CS224W Lecture 2

Node Embeddings

How to learn low-dimensional vectors that encode who is close to whom — DeepWalk, node2vec, and the matrix factorization connection.

Prerequisites: Graph basics (from Lec 1) + dot products + logarithms. That's it.
10
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: The Problem

Suppose you want to classify nodes in a social network — spam accounts vs. real users. Your first instinct: compute some features for each node and feed them to a classifier. You choose degree (how many friends), clustering coefficient (how tight-knit the neighborhood), PageRank (how influential).

These features work, somewhat. But they're designed by hand. If you're doing a new task — say, link prediction instead of classification — you might need completely different features. Degree might not matter; triangle closure might. You're back to engineering features from scratch.

Worse, all of these features are summaries. They compress the node's neighborhood into a single number, losing enormous amounts of structural information. Two nodes that play completely different roles in the graph might happen to have the same degree and the same clustering coefficient.

The core question: Can we learn features automatically — features that encode graph structure without us having to decide what to measure? Yes. That's node embedding.

The Goal

Map every node v to a vector zv in d-dimensional space (d = 64, 128, 256...) such that nodes that are similar in the graph end up nearby in vector space. "Similar" will be defined precisely later — but the intuition is: if u and v are close in the graph, then zu and zv should be close in Euclidean space.

similarity(u, v) ≈ zvT zu

The dot product zvT zu measures how aligned two vectors are. High dot product = nearby in embedding space = similar in the graph. We want this to hold automatically, without hand-crafting anything.

The Embedding Idea

Nodes in a graph (left) are mapped to vectors in 2D embedding space (right). Similar nodes — same color cluster — land close together. Click "Embed" to see the mapping.

Graph view. Click Embed to project nodes into 2D vector space.
What is the main limitation of hand-designed graph features like degree or clustering coefficient?

Chapter 1: Encoder-Decoder Framework

We need a unified language for talking about node embeddings. The encoder-decoder framework gives us exactly that. Every embedding method can be described by: (1) how it maps nodes to vectors (the encoder), and (2) what it's trying to reconstruct from those vectors (the decoder).

The Encoder

The encoder ENC(v) takes a node and returns its embedding vector:

ENC(v) = zv ∈ ℝd

The simplest possible encoder: an embedding lookup table. Store a d-dimensional vector for every node in the graph. Looking up node v means returning column v of a matrix Z of size d × |V|. Or equivalently: multiply Z by a one-hot vector that has a 1 only at position v.

ENC(v) = Z · v     (Z ∈ ℝd × |V|, v is one-hot)

Think of Z as a learnable parameter matrix. Each column is one node's embedding. Training means finding the Z that makes embeddings useful.

The Decoder

The decoder takes two embedding vectors and produces a score measuring how similar those nodes should be:

DEC(zu, zv) = zuT zv

Just the dot product. Simple, differentiable, captures alignment between vectors. If two vectors point in the same direction, their dot product is large and positive — they're "similar." If they're perpendicular, the dot product is zero — no relationship. If they point opposite directions, the dot product is negative.

The Objective

Learn the encoder (Z) such that the decoder output (dot product) approximates node similarity in the graph:

DEC(zu, zv) = zuT zv ≈ similarity(u, v)

The key question becomes: what is "similarity"? That's what distinguishes different embedding methods. We defer that question to the next chapter.

What we're NOT doing yet: The encoder here is just a lookup table — no neural network, no aggregation from neighbors. Chapters 3–5 add that. GNNs (Lectures 4+) replace the lookup encoder with a deep network that computes each node's embedding from its neighborhood structure. But the encoder-decoder framework stays the same.
Encoder-Decoder Diagram

Click two nodes to see how their embeddings are computed (encoder) and how their similarity score is produced (decoder).

Click two nodes to compute their similarity.
In the lookup-table encoder ENC(v) = Z·v, what does each column of Z represent?

Chapter 2: What Is "Similarity"?

We said we want the dot product zuT zv to approximate "node similarity." But what IS node similarity? This turns out to be the most important design choice in any embedding method.

The Simplest Definition: Adjacency

The most obvious choice: u and v are similar if and only if there is an edge between them. Then:

zuT zv ≈ A[u, v]

where A is the adjacency matrix. This means we want ZTZ ≈ A. We want the matrix of all pairwise dot products to approximate the adjacency matrix. This is literally matrix factorization of the adjacency matrix — and it's a classical method called Laplacian Eigenmaps.

Matrix factorization view: Finding Z such that ZTZ ≈ A is equivalent to computing the top-d eigenvectors of A. The resulting embeddings preserve "direct connection" similarity — connected nodes embed nearby, unconnected nodes embed far apart.

Why Adjacency Alone Is Too Simple

Consider two nodes at the ends of a long chain: A—B—C—D—E. Nodes A and E are not adjacent, so under adjacency-based similarity, they're "dissimilar" — their embeddings should be far apart. But structurally, A and E play the same role. They're both leaves at the end of paths. Ideally, they should embed nearby.

More importantly, if u and v have many common friends but no direct edge, we might still want them to be similar. Two users who both talk to the same 50 people are probably in the same community, even if they've never interacted directly.

Richer Similarity: Multi-Hop Neighborhoods

Instead of asking "are u and v connected?", we should ask "do u and v have similar neighborhoods?" This captures community structure, common friends, and structural equivalence. The key insight of random-walk embeddings is that you can measure neighborhood overlap without computing it exactly — just sample paths and see which nodes co-occur.

The right question: Not "are u and v adjacent?" but "if I start a random walk from u, how likely am I to visit v?" This multi-hop, probabilistic notion of similarity is richer, more robust, and captures long-range structure that adjacency ignores.
Similarity DefinitionWhat It CapturesMethod
A[u,v] = 1 (edge exists)Direct connection onlyLaplacian Eigenmaps
Common neighbors: |N(u) ∩ N(v)|Local communitySpectral clustering
Random walk co-occurrence PR(v|u)Multi-hop neighborhood structureDeepWalk, node2vec
Graph kernel similaritySubgraph patternsGraph kernel methods
If we define similarity(u,v) = A[u,v] (adjacency), then training Z to minimize reconstruction error is equivalent to what classical operation on A?

Chapter 3: Random Walk Embeddings

Here's the big idea. Instead of defining similarity as "are u and v adjacent?", we define it as: how likely are u and v to co-occur on a short random walk?

A random walk starts at some node, then repeatedly moves to a randomly chosen neighbor. After L steps, you've visited L+1 nodes. Run many such walks from every node in the graph. If nodes u and v both appear in many of the same walks, they are "similar" — they share structural context.

The Word2Vec analogy: This is exactly how Word2Vec works on text. Replace "words in a sentence" with "nodes in a random walk." A sentence is a sequence of words that co-occur. A random walk is a sequence of nodes that co-occur. Node embeddings trained this way inherit all the nice properties of word embeddings — analogies, clustering, linear interpolation.

The Random Walk Objective

Let NR(u) be the multiset of nodes visited during random walks starting from u. We want to find embeddings that maximize the probability of observing NR(u) given u's embedding:

maxZu ∈ Vv ∈ NR(u) log P(v | zu)

The conditional probability P(v|zu) is modeled as a softmax over dot products:

P(v | zu) = exp(zvT zu) / ∑n ∈ V exp(znT zu)

The softmax says: node v is likely given u if zu · zv is large compared to all other dot products. The denominator normalizes over ALL nodes — every node competes to explain the observation.

Why This Objective Makes Sense

For each node u, we're asking: "Given that I'm at u, what nodes should I expect to see on a random walk from u?" We're training embeddings to answer this question correctly. Nodes that u visits often (nearby neighbors, frequent co-travelers) will have large dot products with zu. Nodes u never visits will have small (or negative) dot products.

Step 1
Run T short random walks of length L from each node u
Step 2
For each u, collect NR(u) = all nodes visited across all walks from u
Step 3
Optimize Z to maximize Σu Σv ∈ NR(u) log P(v|zu)
Result
Z = embedding matrix. Column v = zv = embedding of node v
In the random walk objective, what does NR(u) represent?

Chapter 4: DeepWalk

DeepWalk (Perozzi et al., KDD 2014) is the algorithm that made random walk embeddings practical. The idea is beautifully simple: use uniform random walks. At each step, choose a neighbor uniformly at random. No bias, no preference — pure diffusion.

At each step, the walk moves to a randomly chosen neighbor with equal probability. On a node with degree k, each neighbor has probability 1/k of being chosen. Over many steps, this is equivalent to a random surfer exploring the graph without any preference for direction.

The Word2Vec connection: Once you have your random walk sequences, you run the skip-gram algorithm on them — the same algorithm used to train Word2Vec on text. The "words" are nodes; the "sentences" are random walks. The skip-gram objective is: given a node (word), predict the nodes (words) in a fixed window around it in the walk (sentence).

The DeepWalk Algorithm

python
def deepwalk(G, walk_length=80, walks_per_node=10, d=128, window=5):
    # Step 1: Generate random walk corpus
    walks = []
    for node in G.nodes():
        for _ in range(walks_per_node):
            walk = [node]
            for _ in range(walk_length - 1):
                neighbors = list(G.neighbors(walk[-1]))
                if not neighbors: break
                walk.append(random.choice(neighbors))  # uniform
            walks.append(walk)

    # Step 2: Train skip-gram on walks (same as Word2Vec)
    model = Word2Vec(walks, vector_size=d, window=window,
                      min_count=0, sg=1)  # sg=1 means skip-gram

    # Step 3: Extract embeddings
    embeddings = {node: model.wv[node] for node in G.nodes()}
    return embeddings  # dict: node -> np.array of shape (d,)

Data Flow

StageInputOutputShape
Walk generationGraph G with |V| nodesWalk corpus|V|×T walks of length L
Skip-gram trainingWalk corpusEmbedding matrix Z|V| × d
Node lookupNode ID v (int)Embedding zvd-dimensional vector
DeepWalk: Live Random Walk Explorer

Click a node to start a random walk from it. Watch the walk propagate across the graph. Co-occurrence counts build up as you run more walks — nodes that co-occur often will end up with similar embeddings.

Click a node to start, then run walks. Brighter edges = more co-occurrences.
What makes DeepWalk's random walks "uniform"?

Chapter 5: node2vec — Biased Random Walks

DeepWalk's uniform walks have a hidden assumption: all neighbors are equally relevant. But in practice, there are two very different things you might want your embeddings to capture:

  1. Homophily: Nodes in the same community should embed nearby. User A and User B who are in the same friend group have similar embeddings, even if they've never interacted.
  2. Structural equivalence: Nodes that play the same role in the graph should embed nearby. The central hub of each star-shaped cluster is "structurally equivalent," even if the clusters are far apart in the graph.

Grover and Leskovec (KDD 2016) showed that these two objectives require fundamentally different walk strategies. Homophily requires walks that stay local — BFS-style. Structural equivalence requires walks that explore far — DFS-style.

The key insight of node2vec: Add two parameters p and q that smoothly control whether the random walk explores like BFS (stays in local neighborhood) or DFS (ventures farther out). By tuning p and q, you can optimize for either homophily or structural equivalence.

The Biased Walk: Second-Order Markov

After traversing edge (t → v) and arriving at node v, the next step's transition probability from v depends on the distance to the previous node t. Let w be a neighbor of v we're considering moving to:

Distance from t to wTransition weightInterpretation
dist(t, w) = 0 (w = t, return)1/pReturn to where we came from
dist(t, w) = 1 (w is neighbor of both t and v)1Stay at same distance from t
dist(t, w) = 2 (w is farther from t)1/qMove away from t

These unnormalized weights are then softmaxed over all neighbors of v to get proper transition probabilities.

BFS vs. DFS: How p and q Control It

Low q (q < 1), any p → DFS-like

Weight 1/q is LARGE for moving away from t. The walk keeps exploring outward, visiting nodes farther and farther from the origin. Captures structural equivalence — the walk sees the large-scale structure.

Low p (p < 1), any q → BFS-like

Weight 1/p is LARGE for returning to t. The walk stays tightly around the origin, revisiting local nodes. Captures homophily — the walk samples the local community densely.

node2vec: BFS vs. DFS Walk Explorer ★ SHOWCASE

Adjust p and q to see how walk behavior changes. Low q = DFS (explores far, teal trail). Low p = BFS (stays local, warm trail). The heatmap shows which nodes get visited most from the selected start node.

p (return) 1.0
q (in-out) 1.0
Adjust p and q, then run walks. Node brightness = visit frequency.
You want embeddings where nodes in the same community cluster together (homophily). Which p,q setting should you use?

Chapter 6: Optimization

We have our objective: maximize Σu Σv ∈ NR(u) log P(v|zu). The problem: the softmax denominator sums over ALL nodes.

P(v | zu) = exp(zvT zu) / n ∈ V exp(znT zu)

For a graph with |V| = 1,000,000 nodes, computing this denominator requires one million dot products — for every (u, v) pair in the training data. This is completely intractable.

Negative Sampling

The fix: instead of normalizing over all nodes, use negative sampling. For each positive pair (u, v) that co-occurs in a walk, sample k random "negative" nodes that probably did NOT co-occur. The loss becomes:

log σ(zvT zu) + ∑i=1k Eni ~ PV[log σ(−zniT zu)]

Breaking this down:

Instead of comparing against all |V| nodes, we compare against just k negative samples. With k = 5 to 20, we get a good approximation at a tiny fraction of the cost.

The Negative Sampling Distribution

How should we choose negative samples? Naively, uniform sampling over all nodes. But this over-samples high-degree nodes (hubs appear everywhere). The standard practice: sample nodes with probability proportional to degree3/4. This keeps frequent nodes represented but reduces their dominance:

PV(n) ∝ degree(n)3/4
python
import numpy as np

def negative_sampling_loss(z_u, z_v, Z, degrees, k=10):
    # z_u, z_v: positive pair embeddings, shape (d,)
    # Z: all embeddings, shape (|V|, d)
    # degrees: array of node degrees, shape (|V|,)

    # Positive term: log sigma(z_u · z_v)
    pos = np.log(sigmoid(z_u @ z_v))

    # Sample k negative nodes (proportional to degree^0.75)
    probs = degrees ** 0.75
    probs /= probs.sum()
    neg_idx = np.random.choice(len(degrees), size=k, p=probs)

    # Negative terms: sum of log sigma(-z_u · z_n)
    neg = np.mean(np.log(sigmoid(-Z[neg_idx] @ z_u)))

    return -(pos + neg)  # minimize negative loss
k = 5 to 20 works well in practice. The original Word2Vec paper recommends k = 5 for small datasets and k = 2–5 for large ones. DeepWalk and node2vec both use k = 5–20. With k = 10, you compute 10+1 = 11 dot products instead of |V| — orders of magnitude faster.
Why is the exact softmax normalization ∑n exp(znT zu) intractable for large graphs?

Chapter 7: Matrix Factorization View

Here's a surprising result: DeepWalk is not actually doing something fundamentally new. Under the hood, it's factorizing a very specific matrix derived from the graph. This connection was proved by Levy & Goldberg (2014) for Word2Vec and extended to graphs by Qiu et al. (2018) for NetMF.

What Matrix Does DeepWalk Factorize?

For a graph with adjacency matrix A and degree matrix D = diag(degree(v)), define the transition matrix M = D−1A. Then Mr[u, v] is the probability of going from u to v in exactly r steps. The r-step random walk co-occurrence matrix is the average over all walk lengths from 1 to T:

(1/T) ∑r=1T Mr

Qiu et al. proved that DeepWalk with T-length walks implicitly factorizes the matrix:

ZTZ ≈ log&left;(vol(G) / T · ∑r=1T (D−1A)r · D−1&right;) − log b

where vol(G) = ∑v degree(v) is the total degree sum and b is the number of negative samples. This matrix — a shifted log of the average transition probability matrix — is called the PMI (Pointwise Mutual Information) matrix of the random walk co-occurrences.

Why this matters: Matrix factorization methods (SVD, eigendecomposition) are classical, well-understood, and efficiently computable. Knowing that DeepWalk approximates matrix factorization tells us when it will fail (when the PMI matrix has low rank), when it will succeed, and how to improve it. It also unifies random walk embeddings with the older spectral graph theory literature.

Spectral Methods: The Original Approach

Before DeepWalk, spectral embeddings were standard. The idea: compute the top-d eigenvectors of the normalized Laplacian L = I − D−1/2AD−1/2. This gives the same kind of result — nearby nodes embed nearby — but is limited to linear structure and doesn't generalize well to unseen nodes.

MethodObjectiveComplexityInductive?
Laplacian EigenmapsZTZ ≈ AO(|V|2)No
DeepWalkPMI factorization, random walksO(|V| · T · L)No
node2vecBiased PMI factorizationO(|V| · T · L)No
GNNs (Lec 4+)Learned aggregationO(|E| · d)Yes
DeepWalk implicitly factorizes a matrix related to what quantity?

Chapter 8: Graph-Level Embeddings

So far, we've embedded individual nodes. But sometimes you need to embed an entire graph as a single vector — to classify a molecule as toxic or safe, to compare two protein interaction networks, to detect communities. How do you go from node embeddings to a graph embedding?

Approach 1: Sum / Average Pooling

The simplest idea: once you have node embeddings, just add them all up (or take their mean):

zG = ∑v ∈ V zv     or     zG = (1/|V|) ∑v ∈ V zv

This is permutation-invariant (doesn't depend on node ordering) and simple. But it's a very coarse summary — it collapses all structural information into a fixed-size vector by pure averaging. Two completely different graphs can have the same sum of node embeddings.

Approach 2: Virtual Node

Add a virtual "super-node" connected to every real node. Train embeddings on the augmented graph. The virtual node's embedding, because it's connected to everyone, must encode the global structure. After training, the virtual node's embedding IS the graph embedding.

Analogy: The virtual node is like a committee chair who listens to all members. Its embedding summarizes the "consensus" of the whole graph. Unlike simple averaging, the committee chair can weight different members differently through the random walk process.

Approach 3: Hierarchical Pooling (DiffPool)

The most principled approach: hierarchical clustering then pooling. DiffPool (Ying et al., NeurIPS 2018) simultaneously learns node embeddings and cluster assignments. At each level, it:

  1. Assigns each node to a cluster using a GNN
  2. Computes cluster embeddings by summing node embeddings in each cluster
  3. Builds a new "coarsened" graph where clusters are nodes
  4. Repeats until the graph collapses to a single node

The single remaining node's embedding is the graph embedding. This captures hierarchical structure — which clusters exist, how they connect — rather than just averaging everything.

Graph Pooling Strategies

Three strategies for obtaining a graph-level embedding from node embeddings. Click each method to see the pooling operation illustrated.

Select a pooling strategy above.
What is the key advantage of the "virtual node" approach over simple mean pooling for graph-level embeddings?

Chapter 9: Connections & What's Next

Node embeddings are powerful, but they have a fundamental limitation: the embedding matrix Z is just a lookup table. You can only embed nodes you've seen during training. Add a new node to the graph? You'd have to retrain from scratch. This is called a transductive setting — the model doesn't generalize to unseen inputs.

The Next Step: Inductive Embeddings with GNNs

Graph Neural Networks (Lectures 4+) replace the lookup table encoder with a learned aggregation function. Instead of storing an embedding for each node, a GNN computes a node's embedding on-the-fly by aggregating its neighbors' features. New nodes can be embedded immediately — just run the GNN on their neighborhood. This is the inductive setting.

The key limitation of this lecture's methods: They are transductive — embeddings can only be computed for nodes seen at training time. GNNs solve this. But random walk embeddings remain useful as simple, scalable baselines and as pretraining methods before GNN fine-tuning.

PinSage: Random Walks at Scale

Pinterest's PinSage (Ying et al., KDD 2018) combines random walk ideas with GNNs. It uses random walks to define the neighborhood (who to aggregate from) rather than using all direct neighbors — this is crucial for efficiency on billion-node graphs where full neighborhood aggregation is infeasible.

Limitations and When These Methods Fail

LimitationImpactSolution
TransductiveCannot embed new nodesGNNs (inductive)
Ignores node featuresMisses attribute informationFeature-based GNNs
Parameter-heavyO(|V|×d) parametersGNN weights scale with depth, not nodes
No structural reasoningCan't generalize patternsGraph Transformers, GNN with structural PE

Related Papers

Perozzi et al., KDD 2014. The original random walk embedding paper.
↓ extends
Grover & Leskovec, KDD 2016. Biased walks with p,q parameters.
↓ unified by
NetMF
Qiu et al., WSDM 2018. Proves DeepWalk = matrix factorization of PMI.
↓ superseded by
GNNs (Lec 4+)
Kipf & Welling 2016; Hamilton et al. 2017. Inductive, feature-aware, no lookup tables.

What This Lecture Covered

ConceptKey Idea
Encoder-DecoderENC(v) = zv, DEC(zu, zv) = dot product ≈ similarity
SimilarityAdjacency → random walk co-occurrence → richer structure
DeepWalkUniform walks + skip-gram = PMI matrix factorization
node2vecp controls return, q controls exploration; BFS vs. DFS
Negative SamplingReplace full softmax with k negative examples (k = 5–20)
Graph-levelSum/virtual node/hierarchical pooling of node embeddings
"The goal is not to have a technique but to solve problems." — Jure Leskovec

Next up: Lecture 3 covers Graph Neural Networks, where the encoder becomes a deep network aggregating neighbors rather than a lookup table. The decoder stays the same.