CS224W Lecture 2 — Node Embeddings

Chapter 0: The Problem

Suppose you want to classify nodes in a social network — spam accounts vs. real users. Your first instinct: compute some features for each node and feed them to a classifier. You choose degree (how many friends), clustering coefficient (how tight-knit the neighborhood), PageRank (how influential).

These features work, somewhat. But they're designed by hand. If you're doing a new task — say, link prediction instead of classification — you might need completely different features. Degree might not matter; triangle closure might. You're back to engineering features from scratch.

Worse, all of these features are summaries. They compress the node's neighborhood into a single number, losing enormous amounts of structural information. Two nodes that play completely different roles in the graph might happen to have the same degree and the same clustering coefficient.

The core question: Can we learn features automatically — features that encode graph structure without us having to decide what to measure? Yes. That's node embedding.

The Goal

Map every node v to a vector z_v in d-dimensional space (d = 64, 128, 256...) such that nodes that are similar in the graph end up nearby in vector space. "Similar" will be defined precisely later — but the intuition is: if u and v are close in the graph, then z_u and z_v should be close in Euclidean space.

similarity(u, v) ≈ z_v^T z_u

The dot product z_v^T z_u measures how aligned two vectors are. High dot product = nearby in embedding space = similar in the graph. We want this to hold automatically, without hand-crafting anything.

The Embedding Idea

Nodes in a graph (left) are mapped to vectors in 2D embedding space (right). Similar nodes — same color cluster — land close together. Click "Embed" to see the mapping.

Graph view. Click Embed to project nodes into 2D vector space.

What is the main limitation of hand-designed graph features like degree or clustering coefficient?

They are computationally expensive to compute They must be redesigned for each task, and compress rich structure into single numbers — losing information They only work for directed graphs

Chapter 1: Encoder-Decoder Framework

We need a unified language for talking about node embeddings. The encoder-decoder framework gives us exactly that. Every embedding method can be described by: (1) how it maps nodes to vectors (the encoder), and (2) what it's trying to reconstruct from those vectors (the decoder).

The Encoder

The encoder ENC(v) takes a node and returns its embedding vector:

ENC(v) = z_v ∈ ℝ^d

The simplest possible encoder: an embedding lookup table. Store a d-dimensional vector for every node in the graph. Looking up node v means returning column v of a matrix Z of size d × |V|. Or equivalently: multiply Z by a one-hot vector that has a 1 only at position v.

ENC(v) = Z · v (Z ∈ ℝ^{d × |V|}, v is one-hot)

Think of Z as a learnable parameter matrix. Each column is one node's embedding. Training means finding the Z that makes embeddings useful.

The Decoder

The decoder takes two embedding vectors and produces a score measuring how similar those nodes should be:

DEC(z_u, z_v) = z_u^T z_v

Just the dot product. Simple, differentiable, captures alignment between vectors. If two vectors point in the same direction, their dot product is large and positive — they're "similar." If they're perpendicular, the dot product is zero — no relationship. If they point opposite directions, the dot product is negative.

The Objective

Learn the encoder (Z) such that the decoder output (dot product) approximates node similarity in the graph:

DEC(z_u, z_v) = z_u^T z_v ≈ similarity(u, v)

The key question becomes: what is "similarity"? That's what distinguishes different embedding methods. We defer that question to the next chapter.

What we're NOT doing yet: The encoder here is just a lookup table — no neural network, no aggregation from neighbors. Chapters 3–5 add that. GNNs (Lectures 4+) replace the lookup encoder with a deep network that computes each node's embedding from its neighborhood structure. But the encoder-decoder framework stays the same.

Encoder-Decoder Diagram

Click two nodes to see how their embeddings are computed (encoder) and how their similarity score is produced (decoder).

Click two nodes to compute their similarity.

In the lookup-table encoder ENC(v) = Z·v, what does each column of Z represent?

A feature dimension, shared across all nodes The d-dimensional embedding vector for one specific node A graph convolutional filter

Chapter 2: What Is "Similarity"?

We said we want the dot product z_u^T z_v to approximate "node similarity." But what IS node similarity? This turns out to be the most important design choice in any embedding method.

The Simplest Definition: Adjacency

The most obvious choice: u and v are similar if and only if there is an edge between them. Then:

z_u^T z_v ≈ A[u, v]

where A is the adjacency matrix. This means we want Z^TZ ≈ A. We want the matrix of all pairwise dot products to approximate the adjacency matrix. This is literally matrix factorization of the adjacency matrix — and it's a classical method called Laplacian Eigenmaps.

Matrix factorization view: Finding Z such that Z^TZ ≈ A is equivalent to computing the top-d eigenvectors of A. The resulting embeddings preserve "direct connection" similarity — connected nodes embed nearby, unconnected nodes embed far apart.

Why Adjacency Alone Is Too Simple

Consider two nodes at the ends of a long chain: A—B—C—D—E. Nodes A and E are not adjacent, so under adjacency-based similarity, they're "dissimilar" — their embeddings should be far apart. But structurally, A and E play the same role. They're both leaves at the end of paths. Ideally, they should embed nearby.

More importantly, if u and v have many common friends but no direct edge, we might still want them to be similar. Two users who both talk to the same 50 people are probably in the same community, even if they've never interacted directly.

Richer Similarity: Multi-Hop Neighborhoods

Instead of asking "are u and v connected?", we should ask "do u and v have similar neighborhoods?" This captures community structure, common friends, and structural equivalence. The key insight of random-walk embeddings is that you can measure neighborhood overlap without computing it exactly — just sample paths and see which nodes co-occur.

The right question: Not "are u and v adjacent?" but "if I start a random walk from u, how likely am I to visit v?" This multi-hop, probabilistic notion of similarity is richer, more robust, and captures long-range structure that adjacency ignores.

Similarity Definition	What It Captures	Method
A[u,v] = 1 (edge exists)	Direct connection only	Laplacian Eigenmaps
Common neighbors: \|N(u) ∩ N(v)\|	Local community	Spectral clustering
Random walk co-occurrence P_R(v\|u)	Multi-hop neighborhood structure	DeepWalk, node2vec
Graph kernel similarity	Subgraph patterns	Graph kernel methods

If we define similarity(u,v) = A[u,v] (adjacency), then training Z to minimize reconstruction error is equivalent to what classical operation on A?

Inverting the adjacency matrix Matrix factorization: finding Z such that Z^TZ ≈ A Computing the degree matrix D = diag(degree(v))

Chapter 3: Random Walk Embeddings

Here's the big idea. Instead of defining similarity as "are u and v adjacent?", we define it as: how likely are u and v to co-occur on a short random walk?

A random walk starts at some node, then repeatedly moves to a randomly chosen neighbor. After L steps, you've visited L+1 nodes. Run many such walks from every node in the graph. If nodes u and v both appear in many of the same walks, they are "similar" — they share structural context.

The Word2Vec analogy: This is exactly how Word2Vec works on text. Replace "words in a sentence" with "nodes in a random walk." A sentence is a sequence of words that co-occur. A random walk is a sequence of nodes that co-occur. Node embeddings trained this way inherit all the nice properties of word embeddings — analogies, clustering, linear interpolation.

The Random Walk Objective

Let N_R(u) be the multiset of nodes visited during random walks starting from u. We want to find embeddings that maximize the probability of observing N_R(u) given u's embedding:

max_Z ∑_{u ∈ V} ∑_{v ∈ N_R(u)} log P(v | z_u)

The conditional probability P(v|z_u) is modeled as a softmax over dot products:

P(v | z_u) = exp(z_v^T z_u) / ∑_{n ∈ V} exp(z_n^T z_u)

The softmax says: node v is likely given u if z_u · z_v is large compared to all other dot products. The denominator normalizes over ALL nodes — every node competes to explain the observation.

Why This Objective Makes Sense

For each node u, we're asking: "Given that I'm at u, what nodes should I expect to see on a random walk from u?" We're training embeddings to answer this question correctly. Nodes that u visits often (nearby neighbors, frequent co-travelers) will have large dot products with z_u. Nodes u never visits will have small (or negative) dot products.

Step 1

Run T short random walks of length L from each node u

↓

Step 2

For each u, collect N_R(u) = all nodes visited across all walks from u

↓

Step 3

Optimize Z to maximize Σ_u Σ_{v ∈ N_R(u)} log P(v|z_u)

↓

Result

Z = embedding matrix. Column v = z_v = embedding of node v

In the random walk objective, what does N_R(u) represent?

The set of nodes directly adjacent to u in the graph The top-k nodes with the highest dot product with z_u The multiset of nodes visited during random walks starting from u

Chapter 4: DeepWalk

DeepWalk (Perozzi et al., KDD 2014) is the algorithm that made random walk embeddings practical. The idea is beautifully simple: use uniform random walks. At each step, choose a neighbor uniformly at random. No bias, no preference — pure diffusion.

At each step, the walk moves to a randomly chosen neighbor with equal probability. On a node with degree k, each neighbor has probability 1/k of being chosen. Over many steps, this is equivalent to a random surfer exploring the graph without any preference for direction.

The Word2Vec connection: Once you have your random walk sequences, you run the skip-gram algorithm on them — the same algorithm used to train Word2Vec on text. The "words" are nodes; the "sentences" are random walks. The skip-gram objective is: given a node (word), predict the nodes (words) in a fixed window around it in the walk (sentence).

The DeepWalk Algorithm

python
def deepwalk(G, walk_length=80, walks_per_node=10, d=128, window=5):
    # Step 1: Generate random walk corpus
    walks = []
    for node in G.nodes():
        for _ in range(walks_per_node):
            walk = [node]
            for _ in range(walk_length - 1):
                neighbors = list(G.neighbors(walk[-1]))
                if not neighbors: break
                walk.append(random.choice(neighbors))  # uniform
            walks.append(walk)

    # Step 2: Train skip-gram on walks (same as Word2Vec)
    model = Word2Vec(walks, vector_size=d, window=window,
                      min_count=0, sg=1)  # sg=1 means skip-gram

    # Step 3: Extract embeddings
    embeddings = {node: model.wv[node] for node in G.nodes()}
    return embeddings  # dict: node -> np.array of shape (d,)

Data Flow

Stage	Input	Output	Shape
Walk generation	Graph G with \|V\| nodes	Walk corpus	\|V\|×T walks of length L
Skip-gram training	Walk corpus	Embedding matrix Z	\|V\| × d
Node lookup	Node ID v (int)	Embedding z_v	d-dimensional vector

DeepWalk: Live Random Walk Explorer

Click a node to start a random walk from it. Watch the walk propagate across the graph. Co-occurrence counts build up as you run more walks — nodes that co-occur often will end up with similar embeddings.

Click a node to start, then run walks. Brighter edges = more co-occurrences.

What makes DeepWalk's random walks "uniform"?

It samples the same number of walks from every node At each step, it chooses among neighbors with equal probability — no bias toward any direction It always uses walks of the same length L

Chapter 5: node2vec — Biased Random Walks

DeepWalk's uniform walks have a hidden assumption: all neighbors are equally relevant. But in practice, there are two very different things you might want your embeddings to capture:

Homophily: Nodes in the same community should embed nearby. User A and User B who are in the same friend group have similar embeddings, even if they've never interacted.
Structural equivalence: Nodes that play the same role in the graph should embed nearby. The central hub of each star-shaped cluster is "structurally equivalent," even if the clusters are far apart in the graph.

Grover and Leskovec (KDD 2016) showed that these two objectives require fundamentally different walk strategies. Homophily requires walks that stay local — BFS-style. Structural equivalence requires walks that explore far — DFS-style.

The key insight of node2vec: Add two parameters p and q that smoothly control whether the random walk explores like BFS (stays in local neighborhood) or DFS (ventures farther out). By tuning p and q, you can optimize for either homophily or structural equivalence.

The Biased Walk: Second-Order Markov

After traversing edge (t → v) and arriving at node v, the next step's transition probability from v depends on the distance to the previous node t. Let w be a neighbor of v we're considering moving to:

Distance from t to w	Transition weight	Interpretation
dist(t, w) = 0 (w = t, return)	1/p	Return to where we came from
dist(t, w) = 1 (w is neighbor of both t and v)	1	Stay at same distance from t
dist(t, w) = 2 (w is farther from t)	1/q	Move away from t

These unnormalized weights are then softmaxed over all neighbors of v to get proper transition probabilities.

BFS vs. DFS: How p and q Control It

Low q (q < 1), any p → DFS-like

Weight 1/q is LARGE for moving away from t. The walk keeps exploring outward, visiting nodes farther and farther from the origin. Captures structural equivalence — the walk sees the large-scale structure.

Low p (p < 1), any q → BFS-like

Weight 1/p is LARGE for returning to t. The walk stays tightly around the origin, revisiting local nodes. Captures homophily — the walk samples the local community densely.

node2vec: BFS vs. DFS Walk Explorer ★ SHOWCASE

Adjust p and q to see how walk behavior changes. Low q = DFS (explores far, teal trail). Low p = BFS (stays local, warm trail). The heatmap shows which nodes get visited most from the selected start node.

p (return) 1.0

q (in-out) 1.0

Adjust p and q, then run walks. Node brightness = visit frequency.

You want embeddings where nodes in the same community cluster together (homophily). Which p,q setting should you use?

Low p (encourages return to local neighborhood) — BFS-like walks that sample the community densely Low q (encourages moving away) — DFS-like walks that explore far High p and high q — uniform walks (same as DeepWalk)

Chapter 6: Optimization

We have our objective: maximize Σ_u Σ_{v ∈ N_R(u)} log P(v|z_u). The problem: the softmax denominator sums over ALL nodes.

P(v | z_u) = exp(z_v^T z_u) / ∑_{n ∈ V} exp(z_n^T z_u)

For a graph with |V| = 1,000,000 nodes, computing this denominator requires one million dot products — for every (u, v) pair in the training data. This is completely intractable.

Negative Sampling

The fix: instead of normalizing over all nodes, use negative sampling. For each positive pair (u, v) that co-occurs in a walk, sample k random "negative" nodes that probably did NOT co-occur. The loss becomes:

log σ(z_v^T z_u) + ∑_i=1^k E_{n_i ~ P_V}[log σ(−z_{n_i}^T z_u)]

Breaking this down:

log σ(z_v^T z_u): Maximize the dot product between positive pairs (push them together)
log σ(−z_n^T z_u): Minimize the dot product with negative samples (push them apart)

Instead of comparing against all |V| nodes, we compare against just k negative samples. With k = 5 to 20, we get a good approximation at a tiny fraction of the cost.

The Negative Sampling Distribution

How should we choose negative samples? Naively, uniform sampling over all nodes. But this over-samples high-degree nodes (hubs appear everywhere). The standard practice: sample nodes with probability proportional to degree^3/4. This keeps frequent nodes represented but reduces their dominance:

P_V(n) ∝ degree(n)^3/4

python
import numpy as np

def negative_sampling_loss(z_u, z_v, Z, degrees, k=10):
    # z_u, z_v: positive pair embeddings, shape (d,)
    # Z: all embeddings, shape (|V|, d)
    # degrees: array of node degrees, shape (|V|,)

    # Positive term: log sigma(z_u · z_v)
    pos = np.log(sigmoid(z_u @ z_v))

    # Sample k negative nodes (proportional to degree^0.75)
    probs = degrees ** 0.75
    probs /= probs.sum()
    neg_idx = np.random.choice(len(degrees), size=k, p=probs)

    # Negative terms: sum of log sigma(-z_u · z_n)
    neg = np.mean(np.log(sigmoid(-Z[neg_idx] @ z_u)))

    return -(pos + neg)  # minimize negative loss

k = 5 to 20 works well in practice. The original Word2Vec paper recommends k = 5 for small datasets and k = 2–5 for large ones. DeepWalk and node2vec both use k = 5–20. With k = 10, you compute 10+1 = 11 dot products instead of |V| — orders of magnitude faster.

Why is the exact softmax normalization ∑_n exp(z_n^T z_u) intractable for large graphs?

It requires summing over all |V| nodes — a million dot products per training step The exponential function overflows for large dot products It requires inverting the adjacency matrix

Chapter 7: Matrix Factorization View

Here's a surprising result: DeepWalk is not actually doing something fundamentally new. Under the hood, it's factorizing a very specific matrix derived from the graph. This connection was proved by Levy & Goldberg (2014) for Word2Vec and extended to graphs by Qiu et al. (2018) for NetMF.

What Matrix Does DeepWalk Factorize?

For a graph with adjacency matrix A and degree matrix D = diag(degree(v)), define the transition matrix M = D⁻¹A. Then M^r[u, v] is the probability of going from u to v in exactly r steps. The r-step random walk co-occurrence matrix is the average over all walk lengths from 1 to T:

(1/T) ∑_r=1^T M^r

Qiu et al. proved that DeepWalk with T-length walks implicitly factorizes the matrix:

Z^TZ ≈ log((vol(G) / T · ∑_r=1^T (D⁻¹A)^r · D⁻¹)) − log b

where vol(G) = ∑_v degree(v) is the total degree sum and b is the number of negative samples. This matrix — a shifted log of the average transition probability matrix — is called the PMI (Pointwise Mutual Information) matrix of the random walk co-occurrences.

Why this matters: Matrix factorization methods (SVD, eigendecomposition) are classical, well-understood, and efficiently computable. Knowing that DeepWalk approximates matrix factorization tells us when it will fail (when the PMI matrix has low rank), when it will succeed, and how to improve it. It also unifies random walk embeddings with the older spectral graph theory literature.

Spectral Methods: The Original Approach

Before DeepWalk, spectral embeddings were standard. The idea: compute the top-d eigenvectors of the normalized Laplacian L = I − D^−1/2AD^−1/2. This gives the same kind of result — nearby nodes embed nearby — but is limited to linear structure and doesn't generalize well to unseen nodes.

Method	Objective	Complexity	Inductive?
Laplacian Eigenmaps	Z^TZ ≈ A	O(\|V\|²)	No
DeepWalk	PMI factorization, random walks	O(\|V\| · T · L)	No
node2vec	Biased PMI factorization	O(\|V\| · T · L)	No
GNNs (Lec 4+)	Learned aggregation	O(\|E\| · d)	Yes

DeepWalk implicitly factorizes a matrix related to what quantity?

The raw adjacency matrix A The log of the average multi-step transition probability matrix (the PMI of random walk co-occurrences) The graph Laplacian L = D − A

Chapter 8: Graph-Level Embeddings

So far, we've embedded individual nodes. But sometimes you need to embed an entire graph as a single vector — to classify a molecule as toxic or safe, to compare two protein interaction networks, to detect communities. How do you go from node embeddings to a graph embedding?

Approach 1: Sum / Average Pooling

The simplest idea: once you have node embeddings, just add them all up (or take their mean):

z_G = ∑_{v ∈ V} z_v or z_G = (1/|V|) ∑_{v ∈ V} z_v

This is permutation-invariant (doesn't depend on node ordering) and simple. But it's a very coarse summary — it collapses all structural information into a fixed-size vector by pure averaging. Two completely different graphs can have the same sum of node embeddings.

Approach 2: Virtual Node

Add a virtual "super-node" connected to every real node. Train embeddings on the augmented graph. The virtual node's embedding, because it's connected to everyone, must encode the global structure. After training, the virtual node's embedding IS the graph embedding.

Analogy: The virtual node is like a committee chair who listens to all members. Its embedding summarizes the "consensus" of the whole graph. Unlike simple averaging, the committee chair can weight different members differently through the random walk process.

Approach 3: Hierarchical Pooling (DiffPool)

The most principled approach: hierarchical clustering then pooling. DiffPool (Ying et al., NeurIPS 2018) simultaneously learns node embeddings and cluster assignments. At each level, it:

Assigns each node to a cluster using a GNN
Computes cluster embeddings by summing node embeddings in each cluster
Builds a new "coarsened" graph where clusters are nodes
Repeats until the graph collapses to a single node

The single remaining node's embedding is the graph embedding. This captures hierarchical structure — which clusters exist, how they connect — rather than just averaging everything.

Graph Pooling Strategies

Three strategies for obtaining a graph-level embedding from node embeddings. Click each method to see the pooling operation illustrated.

Select a pooling strategy above.

What is the key advantage of the "virtual node" approach over simple mean pooling for graph-level embeddings?

The virtual node can implicitly weight different nodes differently through the random walk process, rather than treating all nodes equally It reduces the number of embeddings that need to be stored It is faster to train than mean pooling

Chapter 9: Connections & What's Next

Node embeddings are powerful, but they have a fundamental limitation: the embedding matrix Z is just a lookup table. You can only embed nodes you've seen during training. Add a new node to the graph? You'd have to retrain from scratch. This is called a transductive setting — the model doesn't generalize to unseen inputs.

The Next Step: Inductive Embeddings with GNNs

Graph Neural Networks (Lectures 4+) replace the lookup table encoder with a learned aggregation function. Instead of storing an embedding for each node, a GNN computes a node's embedding on-the-fly by aggregating its neighbors' features. New nodes can be embedded immediately — just run the GNN on their neighborhood. This is the inductive setting.

The key limitation of this lecture's methods: They are transductive — embeddings can only be computed for nodes seen at training time. GNNs solve this. But random walk embeddings remain useful as simple, scalable baselines and as pretraining methods before GNN fine-tuning.

PinSage: Random Walks at Scale

Pinterest's PinSage (Ying et al., KDD 2018) combines random walk ideas with GNNs. It uses random walks to define the neighborhood (who to aggregate from) rather than using all direct neighbors — this is crucial for efficiency on billion-node graphs where full neighborhood aggregation is infeasible.

Limitations and When These Methods Fail

Limitation	Impact	Solution
Transductive	Cannot embed new nodes	GNNs (inductive)
Ignores node features	Misses attribute information	Feature-based GNNs
Parameter-heavy	O(\|V\|×d) parameters	GNN weights scale with depth, not nodes
No structural reasoning	Can't generalize patterns	Graph Transformers, GNN with structural PE

Related Papers

DeepWalk

Perozzi et al., KDD 2014. The original random walk embedding paper.

↓ extends

node2vec

Grover & Leskovec, KDD 2016. Biased walks with p,q parameters.

↓ unified by

NetMF

Qiu et al., WSDM 2018. Proves DeepWalk = matrix factorization of PMI.

↓ superseded by

GNNs (Lec 4+)

Kipf & Welling 2016; Hamilton et al. 2017. Inductive, feature-aware, no lookup tables.

What This Lecture Covered

Concept	Key Idea
Encoder-Decoder	ENC(v) = z_v, DEC(z_u, z_v) = dot product ≈ similarity
Similarity	Adjacency → random walk co-occurrence → richer structure
DeepWalk	Uniform walks + skip-gram = PMI matrix factorization
node2vec	p controls return, q controls exploration; BFS vs. DFS
Negative Sampling	Replace full softmax with k negative examples (k = 5–20)
Graph-level	Sum/virtual node/hierarchical pooling of node embeddings

"The goal is not to have a technique but to solve problems." — Jure Leskovec

Next up: Lecture 3 covers Graph Neural Networks, where the encoder becomes a deep network aggregating neighbors rather than a lookup table. The decoder stays the same.