DeepWalk — Veanors

Chapter 0: The Problem with Graph Features

You're classifying blog posts. Each blog belongs to one (or more) topics: politics, sports, tech, food. You have a social graph — blogs link to each other. Your intuition: blogs that link to each other are probably about the same topic. Nearby structure tells you something useful.

The standard approach in 2013: hand-engineer features. Count the number of common neighbors. Compute clustering coefficients. Calculate degree centrality. Write ten lines of domain-specific logic for each feature. Spend weeks. Get mediocre results. Then add a few more hand-crafted features. Repeat.

The deeper problem: hand-crafted features require you to know in advance what structure matters. But different classification tasks care about different structural signals. There's no single set of hand-crafted features that's universally good. You're essentially doing supervised feature engineering on top of an unsupervised problem.

The core question DeepWalk asks: Can we learn general-purpose node representations — embeddings — directly from graph structure, without any labels, so that these embeddings perform well across many downstream classification tasks?

DeepWalk's answer: yes, and the method comes from an unexpected place — natural language processing. Specifically, from Word2Vec, a technique for learning word embeddings from text corpora. The bridge between words and nodes is a beautiful insight that we'll build up step by step.

But first, let's feel the problem directly. The canvas below shows a small social graph. Nodes are colored by their true community. Your task: can you design features by hand that would let a classifier separate these communities? Notice how some nodes are deeply embedded in one community, while others sit on the boundary.

The Feature Engineering Problem

A small social graph with two communities (warm = community A, teal = community B). Click any node to see its hand-crafted features. Notice how boundary nodes are ambiguous.

Click a node to inspect its features.

Why are hand-crafted graph features problematic for general-purpose node classification?

Graph features are always noisy and unreliable It is impossible to compute graph statistics from an adjacency matrix Different tasks require different structural signals, so no fixed set of features generalizes well across tasks

Chapter 1: The Random Walk ↔ Language Analogy

Here's the core insight of the paper, stated plainly: a random walk on a graph is a sentence. Each node is a word. The sequence of nodes visited is the corpus.

Take a social network. Start at node 3. Walk randomly: 3 → 7 → 2 → 7 → 9 → 4. That's a "sentence": [3, 7, 2, 7, 9, 4]. Run thousands of such walks. You now have a text corpus made of node IDs. Feed it to Word2Vec. Get node embeddings.

This seems almost absurd. Why would a language model technique work on a graph? The paper's answer is empirical and then theoretical. Empirically: it works better than hand-crafted features. Theoretically: random walks on graphs obey the same statistical distribution as words in natural language — both follow a power law. (We'll explore this in Chapter 3.)

The word analogy: In language, words that appear in similar contexts have similar meanings ("cat" and "dog" both appear near "pet," "animal," "fur"). In a graph, nodes that appear in similar random walk contexts have similar graph roles. Word2Vec captures contextual similarity in language — it captures structural similarity in graphs.

What Word2Vec Actually Learns

Word2Vec (specifically the skip-gram variant) takes a word and tries to predict the words surrounding it in a window. "The cat sat on the mat" with window 2 around "sat": predict {cat, on} from "sat." After training on millions of sentences, the resulting vectors encode semantic relationships — famous king − man + woman ≈ queen.

Applied to graphs: given a node v seen in a random walk, predict the nodes that appear near v within a window of size w. After training on thousands of walks, the resulting vectors encode structural relationships — nodes that appear in similar walk contexts embed nearby.

Random Walk as Sentence

Watch a random walk unfold on the graph. Each visited node becomes a "word" in the walk-sentence shown below. The walk builds a corpus, just like reading a text document builds a word sequence.

Walk: (press Step or Run)

In the DeepWalk analogy, what plays the role of a "word" in the language model?

An edge between two nodes A single node visited during a random walk The entire graph adjacency matrix

Chapter 2: Skip-Gram on Graphs

The skip-gram objective: given a node v at position i in a walk, predict all nodes within a window of size w around it. For walk [3, 7, 2, 7, 9, 4] with w=1 and center node at position 2 (node 2), the context nodes are {7, 7} — the immediate neighbors in the walk sequence.

Formally, for each walk of length t and each node v_i in that walk, we want to maximize:

∑_i=1^t ∑_{-w ≤ j ≤ w, j ≠ 0} log Pr(v_i+j | v_i)

This is summed over all walks. Each term asks: how likely is the context node v_i+j given the center node v_i? We model this likelihood with a softmax over all nodes in the graph.

The Softmax

The probability of observing node u as context given center node v is:

Pr(u | v) = exp(e_u · e_v) / ∑_{k ∈ V} exp(e_k · e_v)

Here e_v is the embedding vector for node v. The numerator rewards high dot product between embeddings of co-occurring nodes. The denominator normalizes over the entire graph — which is expensive when the graph has millions of nodes. (That's why DeepWalk uses hierarchical softmax, covered in Chapter 4.)

What gradient descent does: Every time nodes u and v co-occur in a walk window, their embeddings are nudged closer together. Nodes that never co-occur drift apart. After thousands of walks, the geometry of the embedding space mirrors the co-occurrence structure of the graph.

The Showcase: Walks Become Context Pairs

The simulation below shows the full pipeline: random walk → sliding window → (center, context) training pairs → skip-gram update. Watch how a single walk generates many training examples, and how repeated updates pull co-occurring node embeddings closer.

DeepWalk Pipeline — Walks to Embeddings

Click "Generate Walk" to sample a random walk. Then "Extract Pairs" to see the skip-gram training pairs from that walk. Then "Update Embeddings" to see the embedding space shift. Run multiple times to watch embeddings converge.

Window w 2

With walk [A, B, C, D, E] and window size w=2, what are the context nodes for center node C?

Only B and D (immediate neighbors in the walk) A, B, D, E (all nodes within 2 steps of C in the walk) All other nodes in the graph

Chapter 3: The Power Law Connection

Why does a language model technique work on graphs? The paper's theoretical justification is elegant: the two domains share the same statistical structure.

In natural language, word frequencies follow Zipf's law: the most common word appears roughly twice as often as the second most common, three times as often as the third, and so on. If you plot word frequency vs. rank on a log-log scale, you get a straight line — a power law. "The" appears millions of times; "ephemeral" appears rarely.

Now look at random walks on a social network. How often does each node appear across all random walks? The popular, well-connected nodes appear frequently — they're hard to avoid. The peripheral nodes appear rarely. The authors of DeepWalk measure this empirically on several social networks and find: node visit frequency in random walks also follows a power law.

Why this matters: Word2Vec was designed and tuned to work on power-law distributed data. Its tricks (hierarchical softmax, negative sampling, subsampling of frequent words) all assume this distribution. Since node frequencies in random walks match this distribution, all of Word2Vec's machinery transfers directly. This isn't lucky — it's the theoretical reason why the approach works.

What a Power Law Means in Practice

A power law distribution P(x) ∝ x^-α has a "heavy tail." Most nodes appear rarely. A small number of nodes (high-degree hubs) appear constantly. This is exactly the social network phenomenon known as the Pareto principle or the "80-20 rule": 20% of nodes do 80% of the connecting.

For the skip-gram objective, this means: hub nodes accumulate enormous training signal because they appear in many walk windows. Peripheral nodes get sparse signal. Word2Vec handles this gracefully — it was designed for a similar imbalance in language.

Power Law: Node Visit Frequency

Simulate random walks on a graph and observe node visit frequencies. The log-log plot shows whether the distribution follows a power law (straight line). Click "Run Walks" to accumulate walk statistics.

Run walks to see frequency distribution emerge.

Why does the power-law distribution of node visit frequencies justify using Word2Vec on graph random walks?

Because power-law distributions are always easier to optimize than uniform distributions Because Word2Vec requires exactly equal frequency for every token Because Word2Vec's optimizations (hierarchical softmax, subsampling) were designed for power-law distributed data, which random walk frequencies match

Chapter 4: Algorithm Details

Let's be precise about what DeepWalk actually does. The full algorithm has three hyperparameters: γ (walks per node), t (walk length), and w (window size). Each node gets exactly γ random walks started from it, each of length t. Every walk feeds (center, context) pairs into the skip-gram objective.

Input

Graph G = (V, E), embedding dimension d, walks per node γ, walk length t, window w

↓

Initialize

Random embedding matrix Φ ∈ ℝ^{|V| × d} — one row per node

↓

Outer Loop

Repeat γ times (one pass per walk per node)

↓

Shuffle nodes

Random ordering O of all nodes V (so we don't always start from node 1)

↓

Inner Loop

For each v in O: generate walk W_v of length t from v; call SkipGram(Φ, W_v, w)

↓

Output

Embedding matrix Φ — row Φ[v] is node v's d-dimensional representation

Hierarchical Softmax

The raw softmax over all |V| nodes is expensive — O(|V|) per training step. DeepWalk uses hierarchical softmax, a trick from language modeling. Build a binary tree (a Huffman tree, where frequent nodes get shorter codes). The probability of a node is computed as a product of probabilities along its path from root to leaf — O(log |V|) per step instead of O(|V|).

Each internal tree node has a learnable parameter vector. For a leaf node u with root-to-leaf path b₀, b₁, ..., b_log|V|:

Pr(u | Φ[v]) = ∏_l=1^log|V| σ(⁻b_l · Φ[v] · Ψ[n_l])

where σ is sigmoid, b_l ∈ {±1} is the branching direction, and Ψ[n_l] is the parameter of internal node n_l. Frequent nodes (high degree) get short paths — fewer multiplications, faster training.

Streaming and parallelism: Because each walk is independent, DeepWalk is embarrassingly parallelizable. You can generate walks and run SkipGram updates across multiple CPU cores simultaneously. The paper also notes the algorithm is "online" — new nodes and edges can be added without retraining from scratch, just run new walks from the new node.

The Code

python
import numpy as np
import random
from collections import defaultdict

def random_walk(graph, start, length):
    """Single random walk from `start` of given `length`."""
    walk = [start]
    for _ in range(length - 1):
        neighbors = list(graph[walk[-1]])
        if not neighbors:
            break  # dead end
        walk.append(random.choice(neighbors))
    return walk

def generate_walks(graph, gamma, t):
    """Generate gamma walks of length t from every node."""
    walks = []
    nodes = list(graph.keys())
    for _ in range(gamma):
        random.shuffle(nodes)         # shuffle order each epoch
        for v in nodes:
            walks.append(random_walk(graph, v, t))
    return walks                      # list of lists of node IDs

def deepwalk(graph, d=128, gamma=10, t=40, w=5):
    """
    graph: dict mapping node -> list of neighbors
    d:     embedding dimension
    gamma: walks per node
    t:     walk length
    w:     skip-gram window size
    """
    walks = generate_walks(graph, gamma, t)

    # Convert walks to strings for gensim Word2Vec
    str_walks = [[str(n) for n in walk] for walk in walks]

    from gensim.models import Word2Vec
    model = Word2Vec(
        str_walks,
        vector_size=d,
        window=w,
        min_count=0,       # keep all nodes even if rare
        sg=1,              # sg=1 means skip-gram (not CBOW)
        hs=1,              # hs=1 means hierarchical softmax
        workers=4,
        epochs=1           # gensim iterates; we already loop gamma times
    )
    # Return embedding matrix indexed by node ID
    embeddings = {int(k): model.wv[k] for k in model.wv.key_to_index}
    return embeddings

Data flow: graph (adjacency list) → walks (list of node-ID sequences, shape [γ|V|, t]) → string walks (gensim expects strings) → Word2Vec training → embeddings (dict: node_id → ℝ^d). The downstream classifier then takes these d-dimensional vectors as features.

Hierarchical softmax speeds up DeepWalk training from O(|V|) to O(log |V|) per step. How does it achieve this?

By only training on a random subset of nodes each step By organizing nodes in a binary tree and computing probability as a product along the path from root to leaf By reducing the embedding dimension d at each training step

Chapter 5: Results — Multi-Label Node Classification

The paper evaluates on three large social networks with ground-truth community labels. The task is multi-label node classification: each node can belong to multiple communities simultaneously. The evaluation protocol is standard: train a one-vs-rest logistic regression on a fraction of labeled nodes, test on the rest.

Datasets

Dataset	Nodes	Edges	Labels	Label type
BlogCatalog	10,312	333,983	39	Blogger interests
Flickr	80,513	5,899,882	195	Photo interest groups
YouTube	1,138,499	2,990,443	47	Video genre groups

These are real, messy social networks. BlogCatalog is bloggers linking to each other; labels are the topics they write about. Flickr is photo-sharing users; labels are interest groups. YouTube is video creators; labels are genre communities. None of this structure is trivially decodable from degree statistics alone.

Baselines

DeepWalk is compared to several feature engineering baselines:

SpectralClustering: Uses eigenvectors of the graph Laplacian as features. This is the classic approach to "graph structure as features" — but computing eigenvectors is O(|V|³), infeasible at YouTube scale.
Modularity: Uses community detection to assign nodes to clusters, then encodes cluster membership.
EdgeCluster: Assigns edge labels via clustering, uses them as node features.
wvRN (weighted-vote Relational Neighbor): A label-propagation method that uses labeled neighbors directly — this is "cheating" in the sense that it needs labeled data; DeepWalk needs none.

Key Numbers

On BlogCatalog with 10% labeled nodes:

Method	Macro-F1	Micro-F1
SpectralClustering	14.05	17.22
Modularity	16.03	21.16
EdgeCluster	18.65	23.81
wvRN	18.76	24.17
DeepWalk	22.14	26.65

DeepWalk beats all baselines including the semi-supervised wvRN which uses the actual labels. With only 1% labeled data, the gap is even larger — DeepWalk doesn't need labels at training time. The unsupervised embeddings transfer to the classification task with minimal labeled examples needed.

Scaling win: SpectralClustering cannot run on YouTube (too expensive). DeepWalk can — and achieves 18.18 Micro-F1 where the next best computable baseline gets 14.19. This is not a marginal improvement; it's the only method that scales to millions of nodes.

Sensitivity to Training Fraction

A critical result: as labeled training data decreases from 90% to 1%, DeepWalk's relative advantage increases. With 90% labels, supervised methods can get close. With 1% labels, DeepWalk wins decisively. This is the characteristic signature of good unsupervised representation learning — it's most valuable when labeled data is scarce.

Performance vs. Labeled Data Fraction (BlogCatalog)

Reconstructed from Figure 3 of the paper. As labeled data decreases, DeepWalk's advantage grows. Hover (or tap) a curve to highlight it.

DeepWalk's relative advantage over baselines is largest when there is ___ labeled training data.

90% — with almost all nodes labeled, the embedding quality matters most Very little (1–10%) — unsupervised representations are most valuable when labeled examples are scarce Exactly 50% — DeepWalk is calibrated for balanced training sets

Chapter 6: Why This Paper Matters

DeepWalk was published at KDD 2014. By 2024 it had over 8,000 citations. To understand why, you need to understand the moment it arrived in.

In 2014, graph machine learning meant one of two things: (1) hand-engineer features from graph statistics and feed them to a classifier, or (2) use graph kernels that computed structural similarity directly but were computationally expensive and didn't generalize. Neither approach scaled, neither was learnable end-to-end, and neither transferred across tasks.

DeepWalk changed the paradigm. It said: don't engineer features, learn representations. This is the same shift that ImageNet + ConvNets caused for vision in 2012, and that Word2Vec caused for NLP in 2013. DeepWalk brought that shift to graphs.

The paradigm shift: Before DeepWalk: "graphs are special, require graph-specific algorithms." After DeepWalk: "graphs are sequences of nodes; NLP techniques apply directly." This opened the door to applying the entire deep learning ecosystem — transformers, contrastive learning, self-supervised pre-training — to graph-structured data.

What It Got Right

The paper's enduring contributions aren't just the algorithm — it's the framing. By showing that random walks produce power-law-distributed sequences, the authors gave a theoretical reason to believe NLP-style training would work on graphs. This principled justification made the approach more than a hack — it was a bridge between two fields.

DeepWalk also established the benchmark protocol that subsequent graph representation learning papers still use: multi-label classification on BlogCatalog, Flickr, and YouTube. Papers like LINE (2015), node2vec (2016), GraphSAGE (2017), and GCN (2017) all compare against DeepWalk as the baseline. You cannot read graph ML papers without understanding DeepWalk.

What It Got Wrong (Or Didn't Address)

Uniform random walks capture local co-occurrence structure but treat all neighbors equally. They can't distinguish between homophily (community membership) and structural equivalence (role similarity) — that's exactly the problem node2vec was designed to solve. DeepWalk also doesn't use node or edge features — it works purely from topology. GCN and GraphSAGE extend this to feature-rich graphs. DeepWalk is also transductive: you can't embed new nodes without running new walks through the entire training procedure.

The template DeepWalk established: (1) Sample node sequences from the graph. (2) Apply an NLP-style embedding model. (3) Use the learned embeddings for downstream tasks. Every graph embedding method from 2014 to 2018 follows this template. The differences are in step 1 (how you sample) and step 2 (which model you use).

Which aspect of DeepWalk most directly inspired node2vec (the next major paper in the field)?

The use of hierarchical softmax for efficient training The multi-label classification evaluation protocol The use of random walks as sentences — node2vec extends this with biased walks to control the type of structure captured

Chapter 7: Connections & What Came Next

DeepWalk sits at the root of a family tree of graph representation learning methods. Understanding where each successor improves on DeepWalk clarifies the field's development.

Method	Year	Key Change from DeepWalk	Strength
DeepWalk	2014	—	Baseline; scales to millions of nodes
LINE	2015	Replaces random walks with explicit 1st and 2nd order proximity objectives	Works on directed, weighted graphs
node2vec	2016	Biased walks with p,q parameters	Tunable: homophily vs structural equivalence
GraphSAGE	2017	Inductive: samples neighbors, aggregates features	Works on unseen nodes; uses node features
GCN	2017	Spectral graph convolution; end-to-end with labels	Best supervised performance on citation networks
GAT	2018	Attention over neighbors instead of uniform aggregation	Adaptive neighbor weighting
Graph Transformers	2020+	Full attention over all node pairs (with positional encodings)	Long-range dependencies; SOTA on many benchmarks

The Crucial Limitations DeepWalk Exposed

Transductive only
DeepWalk learns one embedding per node, stored in a lookup table. New nodes? No embedding. GraphSAGE solved this by learning an aggregation function instead of per-node vectors.

Topology only
Node features (age, text, category) are ignored. GCN and variants incorporate features by propagating them through the graph structure.

Uniform walks
All neighbors treated equally. node2vec adds biased walks; GAT adds learned attention weights over neighbors.

Undirected only
Random walks don't respect edge direction well. LINE explicitly models directed graphs with in-degree and out-degree embeddings.

The Connection to Self-Supervised Learning

DeepWalk is, in modern terms, a self-supervised pre-training method. It creates a proxy task (predict context nodes from center nodes) from unlabeled data, and the representations learned for the proxy task transfer to downstream tasks. This is exactly what BERT does for text and SimCLR does for images — both from 2019, five years later.

DeepWalk in production: Pinterest's PinSage (2018) adapts DeepWalk's random-walk idea to a bipartite pin-board graph with 3 billion nodes. It powers Pinterest's recommendation system for hundreds of millions of users. DeepWalk's core idea — random walks + skip-gram — scaled to the largest recommendation systems in the world.

Related Veanors Lessons

To go deeper:

node2vec — the direct extension with biased walks and p,q parameters

"If you want to understand graphs, walk on them." — DeepWalk's implicit philosophy, now the foundation of graph deep learning.

What is the key limitation of DeepWalk that GraphSAGE was specifically designed to fix?

DeepWalk cannot handle multi-label classification DeepWalk uses too much memory for the skip-gram window DeepWalk is transductive — it cannot generate embeddings for nodes not seen during training