Networks have two kinds of similarity: direct connections (first-order) and shared neighborhoods (second-order). LINE preserves both simultaneously and scales to millions of nodes via edge sampling — making it one of the first network embedding methods that actually runs on social networks at internet scale.
You run a social network with 100 million users. You want to recommend friends, detect communities, and classify users into categories — all tasks that benefit from dense vector representations (embeddings) of each user. But how do you define "similarity" in a network?
Consider three users on Twitter: Alice and Bob follow each other. Bob and Carol never interact, but they both follow the same 200 accounts — same news sources, same celebrities, same communities. Are Alice-Bob more similar or Bob-Carol?
The answer depends on your task. For friend recommendation: Alice-Bob (direct connection). For interest prediction: possibly Bob-Carol (same consumption pattern). A useful embedding method should capture both types of similarity. Yet prior methods like DeepWalk only ran random walks — they implicitly captured some of both but couldn't control the balance, and they didn't scale beyond a few million nodes.
First-order proximity is the simplest notion of similarity: two nodes are similar if there is a direct edge between them. If Alice and Bob are connected, their embeddings should be close in vector space. Nodes with no edge between them can be anywhere.
LINE defines a joint probability distribution over observed pairs and a predicted distribution from the embeddings. The first-order objective minimizes the KL divergence between them:
This is just the sigmoid function applied to the dot product of the two embedding vectors. The predicted probability that node i and node j are connected is high when their embeddings point in the same direction (positive dot product) and low when they don't.
The empirical distribution is:
Where wij is the edge weight (1 for unweighted graphs) and W = Σ wij over all edges. The objective minimizes KL divergence: O₁ = −Σ wij log p₁(i,j).
Edge thickness = weight. First-order proximity pulls directly-connected nodes together.
Second-order proximity captures shared context: two nodes are similar if they connect to similar neighbors. Bob and Carol might never interact, but if they both follow the same 200 accounts, they're contextually similar. This is exactly the distributional hypothesis from NLP — words that appear in similar contexts have similar meanings.
LINE adopts the Word2Vec skip-gram intuition: for each node v, predict its neighbors' identities from its embedding. Specifically, define two embedding vectors per node: a vertex embedding uv (v as a source) and a context embedding u'v (v as a neighbor of others).
This is a softmax over all nodes. Given node i's embedding ui, the probability that node j is a neighbor equals how well ui predicts u'j relative to all other context embeddings. The empirical distribution is again the edge-weight proportion: p̂₂(j|i) = wij / di, where di is the out-degree (or weighted sum) of node i.
Nodes B and C never connect, but share neighbors (shown orange). Second-order proximity pulls them together. Click to highlight shared context.
LINE trains two separate models — one optimizing the first-order objective, one optimizing the second-order objective — then concatenates the resulting embeddings. If each model produces d/2-dimensional embeddings, the final embedding is d-dimensional.
The separate-then-concatenate approach also has a practical benefit: you can train each model independently, possibly on different hardware, and merge the results. This is important at scale — if you have 128 machines, you can assign 64 to first-order and 64 to second-order training in parallel.
The paper uses d=128 total (64+64). In ablation studies, LINE(1st)+LINE(2nd) concatenated outperforms either alone by significant margins on node classification tasks, confirming that the two objectives capture genuinely complementary information.
python import numpy as np # After training two separate LINE models: emb_first = train_line_first_order(G, dim=64) # [N, 64] emb_second = train_line_second_order(G, dim=64) # [N, 64] # Final embedding: concatenate emb_line = np.concatenate([emb_first, emb_second], axis=1) # [N, 128] # Normalize (optional but common) emb_line = emb_line / np.linalg.norm(emb_line, axis=1, keepdims=True)
The second-order softmax has a fatal flaw: the denominator sums over all nodes. For a 5-million-node graph, each gradient step requires computing e^(u'_k · u_i) for all 5 million k values. This is O(|V|) per update — completely intractable.
Negative sampling replaces the full softmax with a simpler binary classification. Instead of predicting which node is a neighbor, learn to distinguish real neighbors from random "noise" nodes:
The first term says: for the real neighbor j, maximize the sigmoid of u'_j · u_i (make real pairs score high). The second term says: for K randomly sampled "noise" nodes n, maximize the sigmoid of −u'_n · u_i (make fake pairs score low). Together, this is binary cross-entropy between real edges (label=1) and sampled fake edges (label=0).
With K=5 noise samples per real edge, you've replaced an O(|V|) denominator with O(K) = O(5) work. The gradient computation becomes thousands of times faster, enabling training on billion-edge graphs.
Simulate one gradient step. Green = real neighbor (push embedding close). Red = negative samples (push embedding apart). Adjust K to see the tradeoff.
LINE is designed around edge sampling: instead of iterating over nodes and their neighbors, iterate over edges. Sample an edge (i,j) proportional to its weight, apply a gradient update. This is a crucial distinction for weighted graphs where high-weight edges should contribute more updates.
For unweighted graphs, edge sampling is equivalent to uniform edge selection. But for weighted graphs (like trust networks where edge weight = interaction frequency), sampling proportional to weight wij naturally gives more training signal to strong connections. The expected number of times edge (i,j) is sampled equals its weight proportion — which matches what the objective is trying to optimize.
The complete training loop for LINE(2nd) is:
python for step in range(total_steps): # 1. Sample an edge proportional to weight (alias method: O(1)) i, j = alias_sample(edges, weights) # 2. Sample K negative nodes proportional to degree^(3/4) negatives = alias_sample(nodes, degree_pow_075, K) # 3. Compute gradient of binary cross-entropy loss grad = compute_neg_sampling_gradient(u[i], u_ctx[j], u_ctx[negatives]) # 4. Update only the involved vectors (sparse update) u[i] -= lr * grad.vertex u_ctx[j] -= lr * grad.context_pos for n in negatives: u_ctx[n] -= lr * grad.context_neg[n]
Notice step 4: only 1 vertex + (K+1) context vectors are updated per step. No matrix multiply, no batch normalization, no full forward pass. This sparse update is why LINE can train on a single machine with 50 threads and embed 5 million nodes in a few hours.
LINE is evaluated on three tasks across very different graph types: multi-label classification on social networks (Flickr, YouTube), sentiment classification on a word-word co-occurrence network (DBLP), and link prediction. The graphs range from 7K to 1.7M nodes.
| Method | Flickr (Macro-F1) | YouTube (Macro-F1) | DBLP (Macro-F1) |
|---|---|---|---|
| LINE(1st) | 26.3% | 29.1% | 49.7% |
| LINE(2nd) | 31.6% | 30.8% | 56.1% |
| LINE(1st+2nd) | 35.8% | 34.4% | 58.4% |
| DeepWalk | 27.2% | 26.7% | 43.2% |
| Graph Factorization | 17.7% | 13.0% | 52.3% |
LINE(1st+2nd) outperforms all baselines on all datasets. Second-order is consistently stronger than first-order alone — shared neighborhood captures richer signal than direct connection for classification tasks. The combination dominates both.
On link prediction, LINE achieves AUC of 91.4% on directed networks where edge direction matters — significantly outperforming methods that ignore direction. LINE handles directed graphs naturally by using separate source and target embeddings, while random-walk methods implicitly treat graphs as undirected.
LINE established the vocabulary of first- and second-order proximity that subsequent methods built on. Understanding where it fits helps locate what each method improved.
| Method | What it captures | Scalability | Key limitation |
|---|---|---|---|
| DeepWalk (2014) | Random walk contexts (≈2nd order) | Medium | Implicit, uncontrolled tradeoff |
| LINE (2015) | Explicit 1st + 2nd order | High (edge sampling) | Shallow (2 hops max) |
| node2vec (2016) | Biased random walks (BFS+DFS) | Medium | Walk hyperparameters sensitive |
| GraphSAGE (2017) | K-hop neighborhood features | High (mini-batch) | Requires node features |
| SDNE (2016) | Deep autoencoder of adjacency rows | Low | Full adjacency matrix needed |
LINE's biggest limitation is that it's shallow: first-order looks 1 hop away, second-order looks at the distribution over 1-hop neighbors. It never captures 3-hop or longer patterns. Methods like node2vec and GraphSAGE extend this to longer ranges, but at higher cost.
LINE also doesn't use node features — only graph structure. For networks where nodes have rich attributes (text descriptions, images, metadata), GraphSAGE and GAT can leverage features directly in the aggregation. LINE treats all nodes as featureless ids.
Related lessons
Key papers
"The key to network embedding is to preserve the network structure and properties in a low-dimensional space."
— Tang et al. (2015)