How to learn low-dimensional vectors that encode who is close to whom — DeepWalk, node2vec, and the matrix factorization connection.
Suppose you want to classify nodes in a social network — spam accounts vs. real users. Your first instinct: compute some features for each node and feed them to a classifier. You choose degree (how many friends), clustering coefficient (how tight-knit the neighborhood), PageRank (how influential).
These features work, somewhat. But they're designed by hand. If you're doing a new task — say, link prediction instead of classification — you might need completely different features. Degree might not matter; triangle closure might. You're back to engineering features from scratch.
Worse, all of these features are summaries. They compress the node's neighborhood into a single number, losing enormous amounts of structural information. Two nodes that play completely different roles in the graph might happen to have the same degree and the same clustering coefficient.
Map every node v to a vector zv in d-dimensional space (d = 64, 128, 256...) such that nodes that are similar in the graph end up nearby in vector space. "Similar" will be defined precisely later — but the intuition is: if u and v are close in the graph, then zu and zv should be close in Euclidean space.
The dot product zvT zu measures how aligned two vectors are. High dot product = nearby in embedding space = similar in the graph. We want this to hold automatically, without hand-crafting anything.
Nodes in a graph (left) are mapped to vectors in 2D embedding space (right). Similar nodes — same color cluster — land close together. Click "Embed" to see the mapping.
We need a unified language for talking about node embeddings. The encoder-decoder framework gives us exactly that. Every embedding method can be described by: (1) how it maps nodes to vectors (the encoder), and (2) what it's trying to reconstruct from those vectors (the decoder).
The encoder ENC(v) takes a node and returns its embedding vector:
The simplest possible encoder: an embedding lookup table. Store a d-dimensional vector for every node in the graph. Looking up node v means returning column v of a matrix Z of size d × |V|. Or equivalently: multiply Z by a one-hot vector that has a 1 only at position v.
Think of Z as a learnable parameter matrix. Each column is one node's embedding. Training means finding the Z that makes embeddings useful.
The decoder takes two embedding vectors and produces a score measuring how similar those nodes should be:
Just the dot product. Simple, differentiable, captures alignment between vectors. If two vectors point in the same direction, their dot product is large and positive — they're "similar." If they're perpendicular, the dot product is zero — no relationship. If they point opposite directions, the dot product is negative.
Learn the encoder (Z) such that the decoder output (dot product) approximates node similarity in the graph:
The key question becomes: what is "similarity"? That's what distinguishes different embedding methods. We defer that question to the next chapter.
Click two nodes to see how their embeddings are computed (encoder) and how their similarity score is produced (decoder).
We said we want the dot product zuT zv to approximate "node similarity." But what IS node similarity? This turns out to be the most important design choice in any embedding method.
The most obvious choice: u and v are similar if and only if there is an edge between them. Then:
where A is the adjacency matrix. This means we want ZTZ ≈ A. We want the matrix of all pairwise dot products to approximate the adjacency matrix. This is literally matrix factorization of the adjacency matrix — and it's a classical method called Laplacian Eigenmaps.
Consider two nodes at the ends of a long chain: A—B—C—D—E. Nodes A and E are not adjacent, so under adjacency-based similarity, they're "dissimilar" — their embeddings should be far apart. But structurally, A and E play the same role. They're both leaves at the end of paths. Ideally, they should embed nearby.
More importantly, if u and v have many common friends but no direct edge, we might still want them to be similar. Two users who both talk to the same 50 people are probably in the same community, even if they've never interacted directly.
Instead of asking "are u and v connected?", we should ask "do u and v have similar neighborhoods?" This captures community structure, common friends, and structural equivalence. The key insight of random-walk embeddings is that you can measure neighborhood overlap without computing it exactly — just sample paths and see which nodes co-occur.
| Similarity Definition | What It Captures | Method |
|---|---|---|
| A[u,v] = 1 (edge exists) | Direct connection only | Laplacian Eigenmaps |
| Common neighbors: |N(u) ∩ N(v)| | Local community | Spectral clustering |
| Random walk co-occurrence PR(v|u) | Multi-hop neighborhood structure | DeepWalk, node2vec |
| Graph kernel similarity | Subgraph patterns | Graph kernel methods |
Here's the big idea. Instead of defining similarity as "are u and v adjacent?", we define it as: how likely are u and v to co-occur on a short random walk?
A random walk starts at some node, then repeatedly moves to a randomly chosen neighbor. After L steps, you've visited L+1 nodes. Run many such walks from every node in the graph. If nodes u and v both appear in many of the same walks, they are "similar" — they share structural context.
Let NR(u) be the multiset of nodes visited during random walks starting from u. We want to find embeddings that maximize the probability of observing NR(u) given u's embedding:
The conditional probability P(v|zu) is modeled as a softmax over dot products:
The softmax says: node v is likely given u if zu · zv is large compared to all other dot products. The denominator normalizes over ALL nodes — every node competes to explain the observation.
For each node u, we're asking: "Given that I'm at u, what nodes should I expect to see on a random walk from u?" We're training embeddings to answer this question correctly. Nodes that u visits often (nearby neighbors, frequent co-travelers) will have large dot products with zu. Nodes u never visits will have small (or negative) dot products.
DeepWalk (Perozzi et al., KDD 2014) is the algorithm that made random walk embeddings practical. The idea is beautifully simple: use uniform random walks. At each step, choose a neighbor uniformly at random. No bias, no preference — pure diffusion.
At each step, the walk moves to a randomly chosen neighbor with equal probability. On a node with degree k, each neighbor has probability 1/k of being chosen. Over many steps, this is equivalent to a random surfer exploring the graph without any preference for direction.
python def deepwalk(G, walk_length=80, walks_per_node=10, d=128, window=5): # Step 1: Generate random walk corpus walks = [] for node in G.nodes(): for _ in range(walks_per_node): walk = [node] for _ in range(walk_length - 1): neighbors = list(G.neighbors(walk[-1])) if not neighbors: break walk.append(random.choice(neighbors)) # uniform walks.append(walk) # Step 2: Train skip-gram on walks (same as Word2Vec) model = Word2Vec(walks, vector_size=d, window=window, min_count=0, sg=1) # sg=1 means skip-gram # Step 3: Extract embeddings embeddings = {node: model.wv[node] for node in G.nodes()} return embeddings # dict: node -> np.array of shape (d,)
| Stage | Input | Output | Shape |
|---|---|---|---|
| Walk generation | Graph G with |V| nodes | Walk corpus | |V|×T walks of length L |
| Skip-gram training | Walk corpus | Embedding matrix Z | |V| × d |
| Node lookup | Node ID v (int) | Embedding zv | d-dimensional vector |
Click a node to start a random walk from it. Watch the walk propagate across the graph. Co-occurrence counts build up as you run more walks — nodes that co-occur often will end up with similar embeddings.
DeepWalk's uniform walks have a hidden assumption: all neighbors are equally relevant. But in practice, there are two very different things you might want your embeddings to capture:
Grover and Leskovec (KDD 2016) showed that these two objectives require fundamentally different walk strategies. Homophily requires walks that stay local — BFS-style. Structural equivalence requires walks that explore far — DFS-style.
After traversing edge (t → v) and arriving at node v, the next step's transition probability from v depends on the distance to the previous node t. Let w be a neighbor of v we're considering moving to:
| Distance from t to w | Transition weight | Interpretation |
|---|---|---|
| dist(t, w) = 0 (w = t, return) | 1/p | Return to where we came from |
| dist(t, w) = 1 (w is neighbor of both t and v) | 1 | Stay at same distance from t |
| dist(t, w) = 2 (w is farther from t) | 1/q | Move away from t |
These unnormalized weights are then softmaxed over all neighbors of v to get proper transition probabilities.
Weight 1/q is LARGE for moving away from t. The walk keeps exploring outward, visiting nodes farther and farther from the origin. Captures structural equivalence — the walk sees the large-scale structure.
Weight 1/p is LARGE for returning to t. The walk stays tightly around the origin, revisiting local nodes. Captures homophily — the walk samples the local community densely.
Adjust p and q to see how walk behavior changes. Low q = DFS (explores far, teal trail). Low p = BFS (stays local, warm trail). The heatmap shows which nodes get visited most from the selected start node.
We have our objective: maximize Σu Σv ∈ NR(u) log P(v|zu). The problem: the softmax denominator sums over ALL nodes.
For a graph with |V| = 1,000,000 nodes, computing this denominator requires one million dot products — for every (u, v) pair in the training data. This is completely intractable.
The fix: instead of normalizing over all nodes, use negative sampling. For each positive pair (u, v) that co-occurs in a walk, sample k random "negative" nodes that probably did NOT co-occur. The loss becomes:
Breaking this down:
Instead of comparing against all |V| nodes, we compare against just k negative samples. With k = 5 to 20, we get a good approximation at a tiny fraction of the cost.
How should we choose negative samples? Naively, uniform sampling over all nodes. But this over-samples high-degree nodes (hubs appear everywhere). The standard practice: sample nodes with probability proportional to degree3/4. This keeps frequent nodes represented but reduces their dominance:
python import numpy as np def negative_sampling_loss(z_u, z_v, Z, degrees, k=10): # z_u, z_v: positive pair embeddings, shape (d,) # Z: all embeddings, shape (|V|, d) # degrees: array of node degrees, shape (|V|,) # Positive term: log sigma(z_u · z_v) pos = np.log(sigmoid(z_u @ z_v)) # Sample k negative nodes (proportional to degree^0.75) probs = degrees ** 0.75 probs /= probs.sum() neg_idx = np.random.choice(len(degrees), size=k, p=probs) # Negative terms: sum of log sigma(-z_u · z_n) neg = np.mean(np.log(sigmoid(-Z[neg_idx] @ z_u))) return -(pos + neg) # minimize negative loss
Here's a surprising result: DeepWalk is not actually doing something fundamentally new. Under the hood, it's factorizing a very specific matrix derived from the graph. This connection was proved by Levy & Goldberg (2014) for Word2Vec and extended to graphs by Qiu et al. (2018) for NetMF.
For a graph with adjacency matrix A and degree matrix D = diag(degree(v)), define the transition matrix M = D−1A. Then Mr[u, v] is the probability of going from u to v in exactly r steps. The r-step random walk co-occurrence matrix is the average over all walk lengths from 1 to T:
Qiu et al. proved that DeepWalk with T-length walks implicitly factorizes the matrix:
where vol(G) = ∑v degree(v) is the total degree sum and b is the number of negative samples. This matrix — a shifted log of the average transition probability matrix — is called the PMI (Pointwise Mutual Information) matrix of the random walk co-occurrences.
Before DeepWalk, spectral embeddings were standard. The idea: compute the top-d eigenvectors of the normalized Laplacian L = I − D−1/2AD−1/2. This gives the same kind of result — nearby nodes embed nearby — but is limited to linear structure and doesn't generalize well to unseen nodes.
| Method | Objective | Complexity | Inductive? |
|---|---|---|---|
| Laplacian Eigenmaps | ZTZ ≈ A | O(|V|2) | No |
| DeepWalk | PMI factorization, random walks | O(|V| · T · L) | No |
| node2vec | Biased PMI factorization | O(|V| · T · L) | No |
| GNNs (Lec 4+) | Learned aggregation | O(|E| · d) | Yes |
So far, we've embedded individual nodes. But sometimes you need to embed an entire graph as a single vector — to classify a molecule as toxic or safe, to compare two protein interaction networks, to detect communities. How do you go from node embeddings to a graph embedding?
The simplest idea: once you have node embeddings, just add them all up (or take their mean):
This is permutation-invariant (doesn't depend on node ordering) and simple. But it's a very coarse summary — it collapses all structural information into a fixed-size vector by pure averaging. Two completely different graphs can have the same sum of node embeddings.
Add a virtual "super-node" connected to every real node. Train embeddings on the augmented graph. The virtual node's embedding, because it's connected to everyone, must encode the global structure. After training, the virtual node's embedding IS the graph embedding.
The most principled approach: hierarchical clustering then pooling. DiffPool (Ying et al., NeurIPS 2018) simultaneously learns node embeddings and cluster assignments. At each level, it:
The single remaining node's embedding is the graph embedding. This captures hierarchical structure — which clusters exist, how they connect — rather than just averaging everything.
Three strategies for obtaining a graph-level embedding from node embeddings. Click each method to see the pooling operation illustrated.
Node embeddings are powerful, but they have a fundamental limitation: the embedding matrix Z is just a lookup table. You can only embed nodes you've seen during training. Add a new node to the graph? You'd have to retrain from scratch. This is called a transductive setting — the model doesn't generalize to unseen inputs.
Graph Neural Networks (Lectures 4+) replace the lookup table encoder with a learned aggregation function. Instead of storing an embedding for each node, a GNN computes a node's embedding on-the-fly by aggregating its neighbors' features. New nodes can be embedded immediately — just run the GNN on their neighborhood. This is the inductive setting.
Pinterest's PinSage (Ying et al., KDD 2018) combines random walk ideas with GNNs. It uses random walks to define the neighborhood (who to aggregate from) rather than using all direct neighbors — this is crucial for efficiency on billion-node graphs where full neighborhood aggregation is infeasible.
| Limitation | Impact | Solution |
|---|---|---|
| Transductive | Cannot embed new nodes | GNNs (inductive) |
| Ignores node features | Misses attribute information | Feature-based GNNs |
| Parameter-heavy | O(|V|×d) parameters | GNN weights scale with depth, not nodes |
| No structural reasoning | Can't generalize patterns | Graph Transformers, GNN with structural PE |
| Concept | Key Idea |
|---|---|
| Encoder-Decoder | ENC(v) = zv, DEC(zu, zv) = dot product ≈ similarity |
| Similarity | Adjacency → random walk co-occurrence → richer structure |
| DeepWalk | Uniform walks + skip-gram = PMI matrix factorization |
| node2vec | p controls return, q controls exploration; BFS vs. DFS |
| Negative Sampling | Replace full softmax with k negative examples (k = 5–20) |
| Graph-level | Sum/virtual node/hierarchical pooling of node embeddings |
"The goal is not to have a technique but to solve problems." — Jure Leskovec
Next up: Lecture 3 covers Graph Neural Networks, where the encoder becomes a deep network aggregating neighbors rather than a lookup table. The decoder stays the same.