node2vec

Chapter 0: The Problem

You're building a recommender system for a social network. You want to learn embeddings for each user such that users who are likely to interact are embedded nearby. Your tool: DeepWalk — uniform random walks on the network. You train embeddings, evaluate, and the system works. Users with common friends cluster together. Done.

Then your colleague runs the same system on a protein-protein interaction network to classify protein function. Results are terrible. DeepWalk, same algorithm, completely different performance. Why?

The answer is that these two tasks require fundamentally different notions of "similar node." In the social network, two users are similar because they live in the same community — they share friends, interests, geography. This is called homophily: birds of a feather flock together. Embeddings should capture community membership.

In the protein network, two proteins are similar because they play the same structural role in different parts of the network — both are "hub connectors" or both are "leaf proteins." This is called structural equivalence: nodes are similar if they have similar local network topology, regardless of where they are in the global graph.

The fundamental tension: Homophily wants embeddings that reflect community membership (close nodes embed together). Structural equivalence wants embeddings that reflect graph role (nodes with similar structure embed together, even if far apart). These require opposite exploration strategies. One embedding method cannot be optimal for both — unless you let the algorithm choose.

Homophily vs. Structural Equivalence

This graph has two communities (warm and teal) connected by a bridge. Nodes A and B are structurally equivalent (both are community hubs) but far apart. Click "Homophily" or "Structural" to see which nodes should embed nearby under each objective.

Select a similarity notion to highlight which nodes should embed nearby.

Two nodes are in completely different parts of a large graph, but both are "connector hubs" bridging two communities. Which similarity notion says they should embed nearby?

Homophily — they are in the same community Structural equivalence — they play the same structural role despite different locations Neither — if they're far apart they should always embed far apart

Chapter 1: Random Walk Framework

node2vec inherits the encoder-decoder framework from DeepWalk. Nodes are mapped to d-dimensional vectors. Similarity is measured by dot product. The objective: find embeddings such that nodes co-occurring in random walks have high dot products.

The Neighborhood N_S(u)

For a node u, define its network neighborhood N_S(u) as the multiset of nodes visited during random walks from u using strategy S. The objective is:

max_f ∑_{u ∈ V} log Pr(N_S(u) | f(u))

where f: V → ℝ^d is the embedding function. Two simplifying assumptions (both from Word2Vec):

Conditional independence: The likelihood of each neighbor node in N_S(u) is independent of the others given f(u). So the product factorizes over individual terms.
Symmetry in feature space: A node's embedding as source and as context (neighbor) are the same. So Pr(n_i|f(u)) is modeled symmetrically.

The conditional probability of visiting node v given source node u is modeled with a softmax:

Pr(v | f(u)) = exp(f(u) · f(v)) / ∑_{n ∈ V} exp(f(u) · f(n))

The strategy S is what makes node2vec different from DeepWalk. It controls HOW the random walk explores the graph — biased toward local or global structure. Different strategies produce different N_S(u), which changes what the optimization objective forces embeddings to encode.

Why Random Walks, Not BFS/DFS Directly?

You might ask: why not directly compute BFS neighborhoods or DFS neighborhoods and use those as N_S(u)? Two reasons. First, exact BFS/DFS neighborhoods can be huge — they scale with graph size. Random walks give a flexible, sampled approximation that's controllable via walk length. Second, pure BFS and DFS are the extremes. The power of node2vec is the ability to smoothly interpolate between them, finding the right balance for your task.

The key question: What walk strategy S produces a neighborhood N_S(u) that, when embedded, best captures the similarity we care about? DeepWalk uses S = uniform. node2vec uses S = biased with parameters p and q.

Why do the two assumptions (conditional independence and feature space symmetry) simplify the optimization?

They let you decompose the joint probability over N_S(u) into a sum of individual log-probabilities, one per visited node — making the loss tractable They eliminate the need for the softmax normalization They guarantee the embedding matrix converges to the adjacency matrix

Chapter 2: BFS vs. DFS Neighborhoods

Consider a source node u. Two classical graph exploration strategies give very different notions of "neighborhood":

BFS: Breadth-First Sampling

Breadth-first search explores all direct neighbors of u before moving to any node two hops away. The BFS neighborhood N_BFS(u) stays tightly local — it's essentially the immediate neighborhood of u, perhaps up to depth 2 or 3.

BFS neighborhoods capture homophily. Two nodes with overlapping BFS neighborhoods — sharing many neighbors — are likely in the same community. BFS-based embeddings encode community membership: users in the same social circle embed nearby.

DFS: Depth-First Sampling

Depth-first search follows one path as far as possible before backtracking. The DFS neighborhood N_DFS(u) rapidly moves away from u, visiting distant nodes.

DFS neighborhoods capture structural equivalence. A long DFS walk from u visits nodes across the whole graph, exposing structural patterns. Two nodes where DFS walks have similar statistical distributions (visiting similar types of nodes in similar proportions) are likely structurally equivalent — even if they're in completely different parts of the graph.

The paper's framing: The choice of S determines which network property gets preserved in embedding space. BFS → homophily. DFS → structural equivalence. node2vec's insight: don't choose — parameterize the walk to interpolate continuously between the two.

Why Pure BFS or DFS Fails

BFS limitation

The BFS neighborhood of u is exactly the same if u is the center of a star or the center of a clique — both have the same set of immediate neighbors. BFS cannot distinguish structural roles beyond degree.

DFS limitation

DFS is high-variance. A single depth-first walk from u might take completely different paths on different runs. It's inefficient for capturing local community structure, where you need to densely sample the neighborhood.

BFS vs. DFS Neighborhood Visualization

Click a node, then toggle BFS/DFS to see which nodes get included in its neighborhood (highlighted). Notice how BFS stays local and DFS reaches far.

Click a node, then select BFS or DFS to see its neighborhood.

For a node classification task where community membership determines the label (e.g., political affiliation in a social network), which exploration strategy should dominate?

BFS — it densely samples the local community, capturing shared neighbors and community membership DFS — it explores far and reveals structural patterns Neither — community membership can only be detected with spectral methods

Chapter 3: The p,q Parameters — Walk Explorer

node2vec introduces a second-order random walk. Unlike a standard random walk that only remembers the current node, node2vec's walk remembers the previous node too. This one extra bit of memory is enough to implement a continuous BFS–DFS tradeoff.

The Transition Probabilities

Suppose the walk has just traversed edge (t → v) — it came from node t and is now at node v. It must choose where to go next. For each neighbor x of v, the unnormalized transition weight α_pq(t, x) is:

α_pq(t, x) = 1/p if d_tx = 0    (x = t, returning)
1        if d_tx = 1    (x is neighbor of both t and v)
1/q if d_tx = 2    (x is farther from t than v was)

where d_tx is the shortest-path distance from t to x. These unnormalized weights are then divided by their sum to give proper transition probabilities.

Intuition for p (Return Parameter)

p controls the likelihood of revisiting a node you just came from. High p: returning to t is very unlikely (1/p is small) — the walk prefers to explore new territory. Low p: returning to t is very likely — the walk tends to backtrack, staying local. Think of p as the "backtracking penalty."

Intuition for q (In-Out Parameter)

q controls whether the walk explores inward (toward t) or outward (away from t). Low q: 1/q is large — moving farther from t is preferred — the walk explores outward, DFS-style. High q: 1/q is small — moving farther from t is discouraged — the walk prefers nodes near t, BFS-style. Think of q as "how adventurous the walk is."

The sweet spot: p=1, q=1 → uniform walk (same as DeepWalk). p<1, q>1 → BFS-like, captures homophily. p>1, q<1 → DFS-like, captures structural equivalence. The paper searches p,q in {0.25, 0.5, 1, 2, 4} and finds the best pair for each task via cross-validation.

node2vec Walk Explorer — p & q Sliders ★ SHOWCASE

Select a start node, adjust p and q, then run walks. The heatmap shows which nodes get visited most. Low q = DFS (explores far, blue halo). Low p = BFS (stays local, orange halo). The walk's "memory" of the previous node drives all the magic.

p (return) 1.0

q (in-out) 1.0

Select a start node (click on graph), adjust p and q, then run walks.

After traversing edge (t→v), node2vec considers moving to neighbor x with d(t,x) = 2. What transition weight is assigned?

1/p — the return parameter penalizes moving away from t 1 — the neutral weight for staying at the same distance from t 1/q — the in-out parameter controls how much to favor moving farther from t

Chapter 4: Skip-gram Objective

Once walks are generated, node2vec uses the skip-gram objective from Word2Vec. The walk is treated as a "sentence": a sequence of node IDs. The skip-gram objective asks the model to predict the context (surrounding nodes within a window) given a source node.

Formally

Given a walk w₁, w₂, ..., w_L, for each node w_i, predict the nodes w_i-k, ..., w_i-1, w_i+1, ..., w_i+k in a window of size k. The objective maximizes:

∑_i=1^L ∑_{-k ≤ j ≤ k, j≠0} log Pr(w_i+j | f(w_i))

The conditional probability uses the softmax:

Pr(w_i+j | f(w_i)) = exp(f(w_i) · f(w_i+j)) / ∑_{v ∈ V} exp(f(w_i) · f(v))

What Is Each Element?

Symbol	Meaning	Data Flow
f(u)	Embedding of node u	Column u of learnable matrix Z ∈ ℝ^d×\|V\|
f(u) · f(v)	Dot product similarity	ℝ^d · ℝ^d → scalar
Softmax denominator	Partition function over all nodes	\|V\| dot products per training step
Window k	Context size	Typically k = 10 for node2vec
Walk length L	Steps per random walk	Typically L = 80
Walks per node r	Walk count per source node	Typically r = 10

The Walk Generation Strategy

Before optimization, node2vec pre-generates all random walks. This is more efficient than sampling on-the-fly during training. The pipeline:

python
def node2vec_train(G, p, q, d=128, r=10, l=80, k=10):
    # Precompute transition probabilities (second-order Markov)
    probs = precompute_transition_probs(G, p, q)
    # probs[(t, v)] = dict {neighbor: weight} for each edge (t,v)

    # Generate walk corpus
    walks = []
    for _ in range(r):               # r walks per node
        for u in G.nodes():
            walk = biased_walk(G, probs, u, l)  # length l
            walks.append(walk)

    # Train skip-gram (Word2Vec) on walk corpus
    model = Word2Vec(walks, vector_size=d, window=k,
                      sg=1, min_count=0, workers=4)
    return {u: model.wv[u] for u in G.nodes()}
    # Returns dict: node_id -> np.array of shape (d,)

Precomputing transitions is the key optimization. The second-order Markov transitions depend only on the edge (t, v), not on the full walk history. So you can precompute, for every directed edge (t, v) in G, the transition probabilities from v given you came from t. This is O(|E|) space and time upfront — then sampling walks is fast.

With walk length L=80, walks per node r=10, and window k=10, how many training pairs does node2vec generate for a single node u?

r × (2k) × L = 10 × 20 × 80 = 16,000 pairs — each of 10 walks contributes ~20 context nodes per position, over 80 positions L = 80 pairs — one per step in the walk 2k = 20 pairs — only the window context matters

Chapter 5: Negative Sampling

The softmax denominator ∑_v∈V exp(f(u)·f(v)) is the core computational bottleneck. For a graph with a million nodes, computing this sum requires a million dot products — for every (u, context) pair in every training step. This is completely intractable.

The Noise-Contrastive Estimation (NCE) Idea

Instead of normalizing over all V nodes, reformulate as a binary classification problem: given node u and a candidate node v, is v a true context of u (from the walk), or is it noise (a random node)? This is noise-contrastive estimation.

Mikolov et al. showed that a simplified version — negative sampling — works well in practice. For each positive (u, v) pair from the walk, sample k negative nodes n₁, ..., n_k from a noise distribution P_n. The loss becomes:

L = −log σ(f(v) · f(u)) − ∑_i=1^k E_{n_i ~ P_n}[log σ(−f(n_i) · f(u))]

This has two terms:

−log σ(f(v)·f(u)): Maximize the dot product between true co-occurring pairs. Pull their embeddings together.
−log σ(−f(n)·f(u)): Minimize the dot product with negative samples. Push those embeddings apart.

The Noise Distribution

node2vec follows Word2Vec: sample negative nodes with probability proportional to degree^3/4. Why 3/4 and not uniform? Uniform would under-sample high-degree nodes (hubs), which are the most informative negatives — they appear everywhere in the graph. Degree^3/4 biases toward frequent nodes without making them completely dominate:

P_n(v) = degree(v)^3/4 / ∑_u degree(u)^3/4

Effect of k on Quality

k (negatives)	Training speed	Embedding quality	Typical use
1–2	Very fast	Lower	Huge graphs (>100M nodes)
5–10	Fast	Good	node2vec default
15–20	Moderate	Very good	Small graphs with sparse edges
>20	Slow	Diminishing returns	Rarely used

Why not exact softmax? Consider a graph with |V| = 10⁶ nodes. With k=10 negatives, each training step costs 11 dot products. With exact softmax, it's 10⁶ dot products. That's a 100,000x speedup. At 10⁸ training pairs, this is the difference between an afternoon and a decade.

python
import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def neg_sample_loss(f_u, f_v, Z, degrees, k=10):
    # f_u, f_v: embeddings of source and context node, shape (d,)
    # Z: all embeddings, shape (|V|, d)
    # degrees: node degree array, shape (|V|,)

    # Positive term
    pos_score = f_u @ f_v                          # scalar
    pos_loss = -np.log(sigmoid(pos_score) + 1e-8)

    # Negative sampling distribution: degree^0.75
    probs = degrees ** 0.75
    probs /= probs.sum()
    neg_idx = np.random.choice(len(degrees), size=k, p=probs)

    # Negative terms
    neg_scores = Z[neg_idx] @ f_u                  # shape (k,)
    neg_loss = -np.sum(np.log(sigmoid(-neg_scores) + 1e-8))

    return pos_loss + neg_loss / k

Why does node2vec sample negatives with probability proportional to degree^3/4 rather than uniform?

Uniform would under-sample high-degree nodes (hubs), which are the most informative negatives; the 3/4 power biases toward frequent nodes without making them completely dominate Degree^3/4 minimizes the variance of the gradient estimator mathematically It prevents the model from ever placing a hub node in a negative sample

Chapter 6: Link Prediction with Embeddings

Node embeddings z_u were designed for nodes. But many tasks are about pairs of nodes — specifically, link prediction: given nodes u and v, does an edge exist between them (or will one form in the future)?

The trick: define an edge embedding using a binary operator on the two node embeddings. node2vec evaluates four operators:

Operator	Formula	Intuition
Hadamard	z_u ⊙ z_v	Element-wise product; captures feature co-occurrence
Average	(z_u + z_v) / 2	Midpoint in embedding space
Weighted-L1	\|z_u − z_v\|	Element-wise absolute difference
Weighted-L2	\|z_u − z_v\|²	Squared element-wise difference

Each operator produces a d-dimensional vector for each edge. This vector is then fed into a logistic regression classifier trained on known edges (positive examples) and non-edges (negative examples).

Training Setup

The link prediction setup requires careful handling to avoid data leakage:

Split edges into train/test sets. Remove test edges from the graph before generating walks.
Generate node embeddings using node2vec on the training graph only.
For each edge in train set and an equal number of non-edges, compute edge embedding using one of the four operators.
Train logistic regression on these edge embeddings.
Evaluate on test edges + equal non-edges using AUC-ROC.

The Hadamard operator wins. Across most datasets in the paper, the Hadamard product (element-wise multiplication) outperforms the other operators for link prediction. The intuition: if both z_u[i] and z_v[i] are large in dimension i, then dimension i represents a feature both nodes share — exactly what you'd expect for nodes that should be connected. The product amplifies agreement in each dimension.

python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np

def link_prediction(embeddings, train_edges, train_nonedges,
                    test_edges, test_nonedges, operator='hadamard'):
    def edge_emb(u, v):
        zu, zv = embeddings[u], embeddings[v]
        if operator == 'hadamard': return zu * zv
        if operator == 'avg':      return (zu + zv) / 2
        if operator == 'l1':       return np.abs(zu - zv)
        if operator == 'l2':       return (zu - zv) ** 2

    # Build training features and labels
    X_train = [edge_emb(u, v) for u, v in train_edges + train_nonedges]
    y_train = [1] * len(train_edges) + [0] * len(train_nonedges)

    clf = LogisticRegression().fit(X_train, y_train)

    # Evaluate on test set
    X_test = [edge_emb(u, v) for u, v in test_edges + test_nonedges]
    y_test  = [1] * len(test_edges) + [0] * len(test_nonedges)
    scores = clf.predict_proba(X_test)[:, 1]
    return roc_auc_score(y_test, scores)

When using node2vec embeddings for link prediction, why must you remove test edges from the graph BEFORE generating random walks?

Otherwise walks could traverse test edges, directly learning that those node pairs co-occur — which would be data leakage since co-occurrence drives the embedding objective Test edges would slow down the walk generation step Removing test edges makes the graph smaller, which is always better for training

Chapter 7: Results

The paper evaluates node2vec on two tasks: multi-label node classification and link prediction. The key result: node2vec outperforms all baselines on both tasks, on all datasets. And the p,q parameters are the reason.

Node Classification Datasets

Dataset	Nodes	Edges	Labels	Task
BlogCatalog	10,312	333,983	39	Social interest groups
Protein-Protein (PPI)	3,890	76,584	50	Biological states of proteins
Wikipedia	4,777	184,812	40	Part-of-speech labels

Node Classification Results (F1 Score)

Method	BlogCatalog	PPI	Wikipedia
Spectral Clustering	0.0405	0.0681	0.1164
DeepWalk	0.2110	0.1768	0.1839
LINE	0.0784	0.1447	0.1244
node2vec	0.2581	0.2033	0.1989

node2vec improves over DeepWalk by 22% on BlogCatalog (a social network — homophily) and 15% on PPI (a protein network — structural equivalence). The improvement comes from tuning p and q.

Link Prediction (AUC-ROC)

Dataset	Spectral	DeepWalk	LINE	node2vec
Facebook	0.9625	0.9680	0.9490	0.9980
PPI	0.9429	0.9568	0.9623	0.9760
arXiv (astro)	0.9607	0.9786	0.9730	0.9996
arXiv (hep)	0.9515	0.9761	0.9668	0.9905

On Facebook (social network), node2vec achieves 0.9980 AUC — near-perfect link prediction. On PPI (protein interaction), it reaches 0.9760. Across all four datasets, node2vec achieves the highest AUC, with the Hadamard operator performing best.

The ablation confirms it: The paper shows that removing the p,q parameterization (setting p=q=1, reducing to DeepWalk) consistently hurts performance. The gap is especially large on BlogCatalog (+22%) — a social network where homophily matters and low-p BFS walks give the right neighborhood definition.

Results Comparison

Bar chart of F1 scores for node classification across methods. node2vec consistently wins — by a larger margin on social networks (homophily dominant) than protein networks (structural equivalence dominant).

node2vec improves over DeepWalk by ~22% on BlogCatalog but only ~15% on PPI. Why might the social network benefit more from p,q tuning?

BlogCatalog is a social network where community membership (homophily) dominates. Low-p BFS walks densely sample communities, providing a much richer neighborhood signal than DeepWalk's uniform walks BlogCatalog has more edges, giving more training signal PPI labels are harder to predict so no method can improve much

Chapter 8: When to Use Which Settings

By now you understand the theory. But in practice, how do you choose p and q for a new graph? Here's a principled guide based on task type, graph structure, and what the literature shows.

Task-Based Guidelines

Task	Signal Type	Recommended	Intuition
Community detection	Homophily	Low p, high q (BFS-like)	Dense local sampling captures community membership
Role discovery	Structural equivalence	High p, low q (DFS-like)	Outward exploration exposes structural patterns
Link prediction (social)	Homophily	Low p, moderate q	Friends-of-friends are most likely future friends
Link prediction (biology)	Mixed	Grid search p,q ∈ {0.25, 0.5, 1, 2, 4}	Protein interactions can be structural or community
Unknown	Unknown	p=1, q=1 (DeepWalk)	Reasonable default; then tune if needed

Graph Structure Signals

You can diagnose what type of similarity is dominant before training by examining graph structure:

High clustering coefficient → tight communities → BFS-favoring (low p)
Many structural holes (bridge nodes) → DFS-favoring (low q)
Power-law degree distribution → hubs play special roles → DFS may help with role discovery
Regular structure (lattice, grid) → uniform walks work fine

Hyperparameter Search Protocol

The paper uses a simple grid search: p ∈ {0.25, 0.5, 1, 2, 4}, q ∈ {0.25, 0.5, 1, 2, 4}. That's 25 combinations. Train node2vec once for each combination (walks only need to be regenerated, not the optimization from scratch for each). Evaluate on a 10% held-out node validation set. Pick the best p, q.

Practical tip: Walk generation is the bottleneck for large graphs, not training. If you have |V| = 10⁶ nodes, generating walks 25 times is expensive. Instead, generate a large walk corpus once and then train with each p,q by reweighting the samples — though this only works approximately. For medium graphs (<100k nodes), full grid search is fast (<30 minutes total).

When NOT to Use node2vec

New nodes (inductive setting): node2vec is a lookup table — new nodes have no embedding. Use GraphSAGE or a GNN instead.
Rich node features: node2vec ignores node attributes. If nodes have feature vectors, GNNs that aggregate them will perform much better.
Dynamic graphs: Node2vec requires retraining for any graph change. For evolving graphs, use CTDNE or streaming methods.
Very small graphs (<100 nodes): the training corpus is tiny; random walk sampling is noisy. Spectral methods or direct GNNs are more reliable.

You're studying protein-protein interaction networks and want to find proteins that play the same structural role (e.g., "hub connector") across different organisms. Which p,q setting should you use?

Low p, high q — BFS-like walks stay in the local community High p, low q — DFS-like walks explore far, exposing structural role patterns across the network p=1, q=1 — structural equivalence only works with uniform walks

Chapter 9: Connections & Legacy

node2vec sits at the intersection of random walk theory, Word2Vec, and network science. Understanding its place in the literature explains both why it worked and where to go next.

Predecessors

Spectral Embeddings (2001–2003)

Belkin & Niyogi (Laplacian Eigenmaps), Roweis & Saul (LLE). Embed by preserving local geometry. Matrix factorization of graph Laplacian. Transductive, no features, expensive.

↓ replaced by stochastic sampling

DeepWalk (2014)

Perozzi et al. Random walks + skip-gram. First to show NLP tools (Word2Vec) work on graphs. Uniform walks — no control over exploration.

↓ extended by biased walks

node2vec (2016)

Grover & Leskovec. Add p, q parameters. Smoothly interpolate BFS–DFS. Enable both homophily and structural equivalence capture.

↓ superseded by learned aggregation

GNNs: GCN, GraphSAGE, GAT (2016–2018)

Kipf & Welling, Hamilton et al., Veličković et al. Replace lookup tables with learned aggregators. Inductive, feature-aware, generalizes to unseen nodes.

What node2vec Got Right (That's Still True)

BFS vs. DFS tradeoff is real. Even modern GNNs must choose their receptive field — local (shallow GNN) or global (deep GNN). The tension between homophily and structural equivalence is a fundamental graph property.
Walk-based neighborhood sampling is efficient. PinSage (Pinterest, 2018) uses random walk sampling inside a GNN framework to avoid expensive full neighborhood aggregation on billion-node graphs.
Simple baselines often win. On many industry benchmarks, node2vec with tuned p,q outperforms complex GNNs when node features are absent or low-quality.

What node2vec Got Wrong (Or Left Unresolved)

No node features. Real nodes have attributes — user demographics, protein sequences, word embeddings. Ignoring them leaves signal on the table.
Transductive. Can't embed new nodes. Unsuitable for streaming or growing graphs.
O(|V|·d) parameter count. For 100M-node graphs, the embedding matrix itself is a memory bottleneck.
No expressive power theory. GNNs have the Weisfeiler-Leman hierarchy; node2vec has no equivalent characterization of what graph structures it can distinguish.

Citation Impact

node2vec has over 14,000 citations (as of 2024). It is one of the most cited graph ML papers ever, and serves as the standard baseline for any node classification or link prediction task. "Did you beat node2vec?" is the minimum bar in graph ML papers.

Method	Year	Key Advance Over node2vec
GraphSAGE	2017	Inductive; aggregates neighbor features not just IDs
GAT	2018	Attention-weighted aggregation; learns importance of neighbors
Graph Transformer	2020+	Global attention; no local neighborhood assumption
SEAL	2018	Subgraph-based link prediction; beats node2vec on all link tasks
METIS + node2vec	Various	Cluster-then-embed for scalability on billion-node graphs

"The key is to be able to control the kind of communities you're trying to learn — and that's exactly what p and q give you." — Jure Leskovec

Paper: Grover & Leskovec, "node2vec: Scalable Feature Learning for Networks," KDD 2016.
arXiv: 1607.00653 | Code: github.com/aditya-grover/node2vec