Tang, Qu, Wang, Zhang, Yan, Mei (Microsoft) — WWW 2015

LINE: Large-scale
Network Embedding

Networks have two kinds of similarity: direct connections (first-order) and shared neighborhoods (second-order). LINE preserves both simultaneously and scales to millions of nodes via edge sampling — making it one of the first network embedding methods that actually runs on social networks at internet scale.

Prerequisites: graph basics + probability
8
Chapters
4+
Simulations

Chapter 0: The Problem

You run a social network with 100 million users. You want to recommend friends, detect communities, and classify users into categories — all tasks that benefit from dense vector representations (embeddings) of each user. But how do you define "similarity" in a network?

Consider three users on Twitter: Alice and Bob follow each other. Bob and Carol never interact, but they both follow the same 200 accounts — same news sources, same celebrities, same communities. Are Alice-Bob more similar or Bob-Carol?

The answer depends on your task. For friend recommendation: Alice-Bob (direct connection). For interest prediction: possibly Bob-Carol (same consumption pattern). A useful embedding method should capture both types of similarity. Yet prior methods like DeepWalk only ran random walks — they implicitly captured some of both but couldn't control the balance, and they didn't scale beyond a few million nodes.

Scale as a first-class constraint: LINE was designed for a specific graph — Microsoft's LinkedIn network with 5 million nodes and 30 million edges. At that scale, methods that require full adjacency matrix operations or eigendecompositions fail. LINE's design decisions (edge sampling, negative sampling, SGD) all flow from this scale requirement.
What are the two kinds of similarity that LINE aims to preserve in network embeddings?

Chapter 1: First-Order Proximity

First-order proximity is the simplest notion of similarity: two nodes are similar if there is a direct edge between them. If Alice and Bob are connected, their embeddings should be close in vector space. Nodes with no edge between them can be anywhere.

LINE defines a joint probability distribution over observed pairs and a predicted distribution from the embeddings. The first-order objective minimizes the KL divergence between them:

p1(i,j) = 11 + exp(−uiT · uj)

This is just the sigmoid function applied to the dot product of the two embedding vectors. The predicted probability that node i and node j are connected is high when their embeddings point in the same direction (positive dot product) and low when they don't.

The empirical distribution is:

&hat;p1(i,j) = wij / W

Where wij is the edge weight (1 for unweighted graphs) and W = Σ wij over all edges. The objective minimizes KL divergence: O₁ = −Σ wij log p₁(i,j).

Why KL divergence? We want the predicted joint distribution p₁(i,j) to match the empirical distribution p̂₁(i,j). KL(p̂ || p) measures how well p approximates p̂. Minimizing it means: for edge (i,j) with weight wij, make the embeddings ui and uj similar enough that the sigmoid of their dot product equals the empirical edge weight proportion.
First-Order: Edge Weights → Embedding Proximity

Edge thickness = weight. First-order proximity pulls directly-connected nodes together.

In LINE's first-order objective, what does a high predicted probability p₁(i,j) require?

Chapter 2: Second-Order Proximity

Second-order proximity captures shared context: two nodes are similar if they connect to similar neighbors. Bob and Carol might never interact, but if they both follow the same 200 accounts, they're contextually similar. This is exactly the distributional hypothesis from NLP — words that appear in similar contexts have similar meanings.

LINE adopts the Word2Vec skip-gram intuition: for each node v, predict its neighbors' identities from its embedding. Specifically, define two embedding vectors per node: a vertex embedding uv (v as a source) and a context embedding u'v (v as a neighbor of others).

p2(j | i) = exp(u'jT · ui) / ∑k exp(u'kT · ui)

This is a softmax over all nodes. Given node i's embedding ui, the probability that node j is a neighbor equals how well ui predicts u'j relative to all other context embeddings. The empirical distribution is again the edge-weight proportion: p̂₂(j|i) = wij / di, where di is the out-degree (or weighted sum) of node i.

Two embedding vectors per node: Every node has a vertex embedding (when it's the "source" making predictions) and a context embedding (when it's the "target" being predicted). Only the vertex embeddings are kept after training. The context embeddings are auxiliary — they help train the vertex embeddings but are discarded. This is identical to how Word2Vec uses center vs context word vectors.
Second-Order: Shared Neighbors → Similar Embeddings

Nodes B and C never connect, but share neighbors (shown orange). Second-order proximity pulls them together. Click to highlight shared context.

Why does LINE maintain two separate embedding vectors per node for the second-order objective?

Chapter 3: Combined Objective

LINE trains two separate models — one optimizing the first-order objective, one optimizing the second-order objective — then concatenates the resulting embeddings. If each model produces d/2-dimensional embeddings, the final embedding is d-dimensional.

Why not jointly optimize both? The authors tried joint optimization and found it didn't outperform separate training with concatenation. The two objectives capture orthogonal information, so keeping them separate avoids interference. Concatenation preserves both signals in a way that a weighted sum of the two losses might not.

The separate-then-concatenate approach also has a practical benefit: you can train each model independently, possibly on different hardware, and merge the results. This is important at scale — if you have 128 machines, you can assign 64 to first-order and 64 to second-order training in parallel.

Raw Graph
Nodes + weighted edges
↓ parallel training
LINE(1st)
d/2-dim embedding
captures direct connections
LINE(2nd)
d/2-dim embedding
captures shared context
↓ concatenate
LINE embedding
d-dim: both proximity types preserved

The paper uses d=128 total (64+64). In ablation studies, LINE(1st)+LINE(2nd) concatenated outperforms either alone by significant margins on node classification tasks, confirming that the two objectives capture genuinely complementary information.

python
import numpy as np

# After training two separate LINE models:
emb_first  = train_line_first_order(G, dim=64)   # [N, 64]
emb_second = train_line_second_order(G, dim=64)  # [N, 64]

# Final embedding: concatenate
emb_line = np.concatenate([emb_first, emb_second], axis=1)  # [N, 128]

# Normalize (optional but common)
emb_line = emb_line / np.linalg.norm(emb_line, axis=1, keepdims=True)
How does LINE combine the first-order and second-order objectives in the final embedding?

Chapter 4: Negative Sampling — Showcase

The second-order softmax has a fatal flaw: the denominator sums over all nodes. For a 5-million-node graph, each gradient step requires computing e^(u'_k · u_i) for all 5 million k values. This is O(|V|) per update — completely intractable.

Negative sampling replaces the full softmax with a simpler binary classification. Instead of predicting which node is a neighbor, learn to distinguish real neighbors from random "noise" nodes:

O2 = −log σ(u'jT ui) − ∑n=1K En~Pn[log σ(−u'nT ui)]

The first term says: for the real neighbor j, maximize the sigmoid of u'_j · u_i (make real pairs score high). The second term says: for K randomly sampled "noise" nodes n, maximize the sigmoid of −u'_n · u_i (make fake pairs score low). Together, this is binary cross-entropy between real edges (label=1) and sampled fake edges (label=0).

With K=5 noise samples per real edge, you've replaced an O(|V|) denominator with O(K) = O(5) work. The gradient computation becomes thousands of times faster, enabling training on billion-edge graphs.

Noise distribution: LINE samples negative nodes proportional to degree^(3/4). This is the same power-law trick as Word2Vec's negative sampling. It down-weights very high-degree nodes (they'd be sampled too often if proportional to raw degree) while still preferring common nodes over rare ones. Empirically, 3/4 outperforms uniform sampling and degree-proportional sampling.
Negative Sampling — Interactive Showcase

Simulate one gradient step. Green = real neighbor (push embedding close). Red = negative samples (push embedding apart). Adjust K to see the tradeoff.

K negatives 5
What problem does negative sampling solve in LINE's second-order objective?

Chapter 5: Scalability

LINE is designed around edge sampling: instead of iterating over nodes and their neighbors, iterate over edges. Sample an edge (i,j) proportional to its weight, apply a gradient update. This is a crucial distinction for weighted graphs where high-weight edges should contribute more updates.

For unweighted graphs, edge sampling is equivalent to uniform edge selection. But for weighted graphs (like trust networks where edge weight = interaction frequency), sampling proportional to weight wij naturally gives more training signal to strong connections. The expected number of times edge (i,j) is sampled equals its weight proportion — which matches what the objective is trying to optimize.

Alias sampling in O(1): To sample edges proportional to weights without O(|E|) search, LINE uses the alias method — a classic trick that preprocesses the weight distribution into a two-column table. Each draw costs exactly 2 random numbers + 1 table lookup: O(1) per sample after O(|E|) preprocessing. This is essential for high-throughput mini-batch training.

The complete training loop for LINE(2nd) is:

python
for step in range(total_steps):
    # 1. Sample an edge proportional to weight (alias method: O(1))
    i, j = alias_sample(edges, weights)

    # 2. Sample K negative nodes proportional to degree^(3/4)
    negatives = alias_sample(nodes, degree_pow_075, K)

    # 3. Compute gradient of binary cross-entropy loss
    grad = compute_neg_sampling_gradient(u[i], u_ctx[j], u_ctx[negatives])

    # 4. Update only the involved vectors (sparse update)
    u[i] -= lr * grad.vertex
    u_ctx[j] -= lr * grad.context_pos
    for n in negatives:
        u_ctx[n] -= lr * grad.context_neg[n]

Notice step 4: only 1 vertex + (K+1) context vectors are updated per step. No matrix multiply, no batch normalization, no full forward pass. This sparse update is why LINE can train on a single machine with 50 threads and embed 5 million nodes in a few hours.

Why does LINE sample edges proportional to their weights rather than sampling nodes uniformly?

Chapter 6: Results

LINE is evaluated on three tasks across very different graph types: multi-label classification on social networks (Flickr, YouTube), sentiment classification on a word-word co-occurrence network (DBLP), and link prediction. The graphs range from 7K to 1.7M nodes.

MethodFlickr (Macro-F1)YouTube (Macro-F1)DBLP (Macro-F1)
LINE(1st)26.3%29.1%49.7%
LINE(2nd)31.6%30.8%56.1%
LINE(1st+2nd)35.8%34.4%58.4%
DeepWalk27.2%26.7%43.2%
Graph Factorization17.7%13.0%52.3%

LINE(1st+2nd) outperforms all baselines on all datasets. Second-order is consistently stronger than first-order alone — shared neighborhood captures richer signal than direct connection for classification tasks. The combination dominates both.

DeepWalk comparison: LINE consistently outperforms DeepWalk by 4-8 percentage points on social network classification. More practically: DeepWalk on the 1.7M-node YouTube graph required running on a cluster. LINE ran on a single 50-thread server in comparable time. This efficiency gap was the industrial motivation.

On link prediction, LINE achieves AUC of 91.4% on directed networks where edge direction matters — significantly outperforming methods that ignore direction. LINE handles directed graphs naturally by using separate source and target embeddings, while random-walk methods implicitly treat graphs as undirected.

In LINE's experiments, which consistently outperforms: LINE(1st), LINE(2nd), or their combination?

Chapter 7: Connections

LINE established the vocabulary of first- and second-order proximity that subsequent methods built on. Understanding where it fits helps locate what each method improved.

MethodWhat it capturesScalabilityKey limitation
DeepWalk (2014)Random walk contexts (≈2nd order)MediumImplicit, uncontrolled tradeoff
LINE (2015)Explicit 1st + 2nd orderHigh (edge sampling)Shallow (2 hops max)
node2vec (2016)Biased random walks (BFS+DFS)MediumWalk hyperparameters sensitive
GraphSAGE (2017)K-hop neighborhood featuresHigh (mini-batch)Requires node features
SDNE (2016)Deep autoencoder of adjacency rowsLowFull adjacency matrix needed

LINE's biggest limitation is that it's shallow: first-order looks 1 hop away, second-order looks at the distribution over 1-hop neighbors. It never captures 3-hop or longer patterns. Methods like node2vec and GraphSAGE extend this to longer ranges, but at higher cost.

LINE also doesn't use node features — only graph structure. For networks where nodes have rich attributes (text descriptions, images, metadata), GraphSAGE and GAT can leverage features directly in the aggregation. LINE treats all nodes as featureless ids.

LINE's lasting contribution: The explicit decomposition into first-order and second-order proximity wasn't just an algorithm — it was a conceptual framework. It clarified what "node similarity" means and gave practitioners a vocabulary to reason about what their embedding method preserves. Every subsequent paper now specifies which order of proximity it targets.

Related lessons

  • GraphSAGE — Inductive GNN
  • LightGCN — GCN for recommendation
  • NGCF — Neural collaborative filtering

Key papers

  • Tang et al., WWW 2015 (LINE)
  • Perozzi et al., KDD 2014 (DeepWalk)
  • Grover & Leskovec, KDD 2016 (node2vec)
"The key to network embedding is to preserve the network structure and properties in a low-dimensional space."
— Tang et al. (2015)