Improving Distributional Similarity

Chapter 0: The Debate

By 2013, word embeddings had taken NLP by storm. Word2Vec's Skip-gram model produced vectors where king − man + woman ≈ queen. It seemed like magic, and the prevailing narrative was clear: neural methods have fundamentally surpassed traditional count-based approaches.

But Levy, Goldberg, and Dagan asked an uncomfortable question: is it really the neural architecture that's responsible for these gains, or is it something else entirely?

The traditional pipeline was simple. You count how often words co-occur in a corpus, build a big matrix, maybe apply some weighting scheme like Pointwise Mutual Information (PMI), and reduce dimensions with Singular Value Decomposition (SVD). This approach dominated for decades. Then Word2Vec arrived, and everyone assumed the improvement came from the prediction-based neural training procedure.

The authors noticed something suspicious. Word2Vec didn't just change the algorithm — it also introduced a suite of clever engineering tricks: dynamic context windows, subsampling of frequent words, negative sampling with a smoothed unigram distribution. These tricks were baked into the implementation and never tested independently.

Consider the timeline. In 2003, Bengio published neural language models. In 2008, Collobert and Weston showed learned word vectors could help NLP tasks. In 2013, Mikolov et al. released Word2Vec with Skip-gram and CBOW. The prevailing story was a clean arc: count → predict, and the predict paradigm won.

But the comparisons were never fair. Word2Vec's implementation included: (a) subsampling frequent words with probability √(t/f(w)), (b) using a dynamic context window sampled from 1 to L rather than a fixed L, (c) raising the noise distribution to the 0.75 power, (d) using a specific number of negative samples as an implicit threshold. Traditional count-based papers used none of these tricks. They were comparing an optimized race car to a stock sedan.

The central question: If you take a plain old co-occurrence matrix + SVD pipeline and give it the same engineering tricks that Word2Vec uses, does it match neural embeddings? Spoiler: yes. The "neural revolution" in word similarity was mostly a hyperparameter revolution.

The paper systematically transfers each trick from Word2Vec to count-based methods and measures the impact. The conclusion is stark: there is no consistent winner among the algorithms. The hyperparameters matter far more than whether you use neural prediction or matrix factorization.

The experimental design

What makes this paper rigorous is its systematic factorial experimental design. Instead of testing each method with its "default" settings (which would conflate algorithm choice with configuration choice), the authors test every combination of:

Factor	Values tested	Combinations
Algorithm	PPMI, SVD, SGNS, GloVe	4
Subsampling	None, t=10⁻⁵	2
Context window	Fixed-2, Fixed-5, Dynamic-2, Dynamic-5	4
CDS (α)	1.0 (none), 0.75	2
Dimensions	100, 200, 300, 500	4
Vector type	W only, W + C	2

That's 4 × 2 × 4 × 2 × 4 × 2 = 512 configurations per benchmark. This comprehensive grid allows the paper to isolate the effect of each factor while averaging over all others. When they say "subsampling helps more than algorithm choice," this is backed by 512 data points, not a cherry-picked comparison of 2 configurations.

The analysis uses ANOVA-style decomposition of the total variance. For each benchmark, the paper computes: how much variance is explained by the algorithm factor alone? How much by subsampling? By window type? By CDS? The answer: algorithm choice explains about 10–15% of variance, while subsampling explains 25–35%. Combined, the hyperparameter factors explain 4–5x more variance than the algorithm factor.

The scale of this experiment was unprecedented in the word embedding literature. Most papers compare 2–3 configurations. Levy et al. compare 512. This thoroughness is what makes the conclusions so convincing.

How different comparisons led to different conclusions

Before this paper, several studies compared count-based and prediction-based methods and reached different conclusions. The reason: they used different defaults. Here's a reconstruction of how this happened:

Study	Count-based config	Neural config	Winner	Why
Baroni et al., 2014	Standard PPMI+SVD, no subsampling	Word2Vec with all tricks	Neural	Unfair: tricks not transferred
Pennington et al., 2014	W2V without W+C	GloVe with W+C	GloVe	Unfair: W+C not transferred
Levy et al., 2015	PPMI+SVD with ALL tricks	W2V with ALL tricks	Tie	Fair: all confounders controlled

Each prior study reached a different conclusion because it compared different things. Levy et al. resolved the contradiction by controlling everything. The lesson: when two papers disagree about which method is better, the first question to ask is "are they comparing the same configurations?"

Baroni et al. (2014) revisited: Baroni et al. published a well-cited paper titled "Don't count, predict!" arguing that prediction-based methods systematically outperform count-based ones. Levy et al. directly challenged this conclusion by showing the gap disappears when hyperparameters are equalized. The title of Levy et al.'s paper — "Improving Distributional Similarity with Lessons Learned from Word Embeddings" — is itself a subtle rebuttal: the "lessons" are engineering tricks, not algorithmic insights.

Count-Based vs. Prediction-Based: The Narrative

Toggle to see the prevailing narrative vs. what Levy et al. actually found. Click a method to see its components.

What did the NLP community widely believe about the superiority of Word2Vec over count-based methods before this paper?

Count-based methods used a different corpus The neural prediction-based architecture was the source of improvement, when in fact engineering tricks (subsampling, negative sampling distribution, etc.) were the main contributors Word2Vec was slower but more accurate

Chapter 1: Count-Based Methods

Before we can compare methods fairly, we need to understand the count-based pipeline in detail. It starts with the simplest possible idea: you shall know a word by the company it keeps (Firth, 1957).

Step 1: The co-occurrence matrix

Given a corpus, we build a matrix M where M_ij = how many times word i appears near word j within a context window of size L. "Near" means within L words to the left or right.

For a vocabulary of V words, this gives us a V × V matrix. Each row is a V-dimensional vector representing one word. Words that appear in similar contexts will have similar rows.

The distributional hypothesis: Words with similar meanings tend to appear in similar linguistic contexts. "Cat sat on the mat" and "dog sat on the mat" tell us that cat and dog are semantically related — they share context words like "sat," "on," "the," "mat."

Step 2: Raw counts are terrible

Raw co-occurrence counts have a fatal flaw: frequent words dominate everything. The word "the" co-occurs with practically every word in the vocabulary. If you compute cosine similarity between raw count vectors, every word looks similar because they all co-occur with "the," "a," "is," etc.

The solution: Pointwise Mutual Information (PMI). Instead of asking "how often do i and j co-occur?", we ask "how much more do i and j co-occur than we'd expect if they were independent?"

PMI(i, j) = log( P(i, j) / (P(i) · P(j)) )

Where P(i, j) is the probability of seeing i and j together, and P(i) · P(j) is what we'd expect by chance. If PMI is positive, the words co-occur more than chance. If negative, they avoid each other.

Step 3: Dimension reduction

The PMI matrix is V × V, which for a 100K vocabulary means 10 billion entries. Most are zero (sparse), but it's still unwieldy. We use SVD to compress it to V × d where d might be 300. This captures the most important structure in a dense, low-dimensional representation.

The complete traditional pipeline

1. Corpus

Raw text: billions of words from Wikipedia, news, books

↓

2. Co-occurrence Matrix

M_ij = count of word i near word j (window size L)

↓

3. PMI Transform

PPMI(i,j) = max(0, log(M_ij · D / (M_i* · M_*j)))

↓

4. SVD

PPMI ≈ U_d Σ_d V_d^T → word vectors = U_d Σ_d^0.5

This pipeline was the state of the art for over a decade. Latent Semantic Analysis (Deerwester et al., 1990) used a term-document matrix variant. Hyperspace Analogue to Language (Lund & Burgess, 1996) used word-word co-occurrence. The specific combination of PPMI + truncated SVD was explored by Bullinaria and Levy (2007, 2012) and Turney and Pantel (2010).

The pipeline has two major advantages: it's interpretable (you know exactly what the matrix entries mean) and it's reproducible (SVD is deterministic given the same input). Its disadvantage, pre-Levy-et-al., was supposed to be performance. But that disadvantage turns out to be a myth.

The Count-Based Pipeline

Click each stage to highlight it and see its input/output shapes for a real corpus (V = 100K vocabulary, d = 300 embedding dimensions).

Building a Co-occurrence Matrix

Watch how a context window slides over a sentence to build co-occurrence counts. Adjust the window size to see the effect.

Window size2

python
import numpy as np
from collections import Counter, defaultdict

def build_cooccurrence(corpus, vocab, window=5, weighted=False):
    """Build V x V co-occurrence matrix from tokenized corpus."""
    word2idx = {w: i for i, w in enumerate(vocab)}
    V = len(vocab)
    M = np.zeros((V, V))

    for i, word in enumerate(corpus):
        if word not in word2idx: continue
        wi = word2idx[word]
        for d in range(1, window + 1):
            if i + d < len(corpus) and corpus[i + d] in word2idx:
                ci = word2idx[corpus[i + d]]
                # Dynamic window: weight = (L - d + 1) / L
                weight = (window - d + 1) / window if weighted else 1.0
                M[wi, ci] += weight
                M[ci, wi] += weight
    return M

def subsample(corpus, freqs, threshold=1e-5):
    """Subsample frequent words (Word2Vec's preprocessing trick)."""
    total = sum(freqs.values())
    return [
        w for w in corpus
        if np.random.random() < min(1.0, np.sqrt(threshold / (freqs[w] / total)))
    ]

Why are raw co-occurrence counts a poor basis for word similarity?

Frequent function words like "the" and "is" co-occur with everything, making all words look similar — PMI corrects this by measuring co-occurrence relative to chance The matrix is too small Raw counts only work for verbs, not nouns

Chapter 2: The PPMI Matrix

PMI has a problem: when two words never co-occur, PMI = log(0) = −∞. And even for rare but genuine co-occurrences, PMI gives unreliably high values (two rare words that co-occur once get a very high PMI just because their marginal probabilities are tiny).

Positive PMI (PPMI)

The standard fix: clamp negative values to zero.

PPMI(i, j) = max(0, PMI(i, j))

This is called Positive PMI or PPMI. It says: "only keep pairs that co-occur more than expected. If they co-occur less than expected or not at all, just record zero." This turns out to work very well in practice.

Why PPMI works better than PMI

Negative PMI values tell you "these words co-occur less than expected." In principle, that's information. But in practice, it's noise. The co-occurrence matrix is extremely sparse — most word pairs never co-occur at all, not because they avoid each other, but because we haven't seen enough data. Setting negative PMI to zero acknowledges this sparsity problem and focuses on the positive signal.

There's a deeper reason too. When we use these vectors for cosine similarity, negative entries create a problem: they can make unrelated words appear similar by giving them both negative values in the same dimensions. Clamping to zero avoids this artifact.

The full derivation

Let's be precise. Given a corpus with co-occurrence counts #(w, c) for word w and context c within a window of size L:

PMI(w, c) = log( #(w, c) · D / (#(w) · #(c)) )

Where D = Σ_w,c #(w, c) is the total number of (word, context) pairs, and #(w) = Σ_c #(w, c) is the marginal count of word w. In words: we compare the observed joint frequency to what we'd expect under independence.

Let's work through a concrete example. Suppose our corpus has D = 1,000,000 word-context pairs. The word "cat" appears in 500 pairs total, "mat" appears in 200 pairs total, and "cat" and "mat" co-occur 10 times.

PMI(cat, mat) = log(10 · 1000000 / (500 · 200)) = log(100) = 4.6

That's a strong positive association. Now consider "cat" and "the" (a function word). If #(the) = 50,000 and #(cat, the) = 25:

PMI(cat, the) = log(25 · 1000000 / (500 · 50000)) = log(1.0) = 0.0

"Cat" and "the" co-occur exactly as much as expected by chance — PMI is zero. If we used raw counts, "the" would dominate the representation of "cat." PMI correctly identifies this as uninformative.

Shifted PPMI: the role of negative samples

Word2Vec with k negative samples introduces an implicit threshold. Words whose PMI is below log(k) get treated as noise. The count-based equivalent is Shifted PPMI:

SPPMI(w, c) = max(0, PMI(w, c) − log(k))

With k = 5 negative samples (a common default), the shift is log(5) ≈ 1.6. Only associations with PMI above 1.6 survive. This is more aggressive than plain PPMI (which only removes negative PMI) and helps eliminate weak, noisy associations.

Context distribution smoothing

One of the key insights from Levy et al. is that Word2Vec's negative sampling implicitly modifies the PMI computation. Specifically, negative sampling with the smoothed unigram distribution P(c)^0.75 is equivalent to computing a modified PMI:

PMI_k,α(w, c) = log( #(w, c) · D_α / (#(w) · #(c)^α) )

where α = 0.75 is the smoothing exponent. This raises rare context words' probabilities, making them more likely to be sampled as negatives. The effect: rare-word associations get penalized less, which helps significantly.

PMI vs. PPMI: Effect on Word Pairs

See how PMI and PPMI scores change for different word pairs. Adjust the smoothing exponent α to see how context distribution smoothing (CDS) affects scores.

α (CDS)0.75

The equivalence insight: Levy and Goldberg (2014, in a prior paper) proved that Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix: W · C^T = PMI − log(k), where k is the number of negative samples. This means Word2Vec isn't doing something fundamentally different from count-based methods — it's just factorizing a particular variant of the PMI matrix.

Working through a concrete PPMI calculation

Let's compute PPMI for the sentence "the cat sat on the mat the dog sat on the rug" with a window of 2. First, the counts:

	the	cat	sat	on	mat	dog	rug
the	0	2	1	2	2	2	2
cat	2	0	1	0	0	0	0
sat	1	1	0	2	0	1	0

Total co-occurrences D = sum of all entries. For "cat" and "sat": #(cat,sat) = 1, #(cat) = 3 (total count for cat row), #(sat) = 5 (total for sat column). PMI(cat, sat) = log(1 · D / (3 · 5)). With D ≈ 30: PMI ≈ log(30/(15)) = log(2) = 0.69. Positive — they co-occur more than chance.

For "the" and "cat": #(the,cat) = 2, #(the) = 11, #(cat) = 3. PMI = log(2 · 30 / (11 · 3)) = log(60/33) = log(1.82) = 0.60. Also positive, but this high value for "the" is misleading — it's driven by "the" being very frequent. With CDS (α=0.75), the effective count for "the" would be raised less, reducing this spurious association.

What does context distribution smoothing (α = 0.75) do, and why does it help?

It raises the effective probability of rare context words, penalizing rare-word associations less, which prevents PMI from giving unreliably high scores to rare co-occurrences It removes all stop words from the matrix It doubles the vocabulary size

Chapter 3: SVD Factorization

We have a sparse PPMI matrix of size V × V. We need to compress it to V × d where d ≈ 300. Truncated Singular Value Decomposition is the classic tool for this.

How SVD works

SVD decomposes any matrix M into three matrices:

M = U · Σ · V^T

Where U is V × V (left singular vectors), Σ is a diagonal matrix of singular values (sorted in decreasing order), and V^T is V × V (right singular vectors).

Truncated SVD keeps only the top d singular values and their associated vectors:

M ≈ U_d · Σ_d · V_d^T

This is the best rank-d approximation of M in the Frobenius norm sense. No other rank-d matrix gets closer to M.

What to use as word vectors?

Here's a subtle but important choice. After SVD, we have U_d and Σ_d. The word vectors could be:

Choice	Formula	Effect
W = U_d	Just the left singular vectors	All dimensions weighted equally
W = U_d · Σ_d	Singular vectors scaled by singular values	Top dimensions dominate
W = U_d · Σ_d^p	Power-weighted singular values	Tunable: p=0 is U, p=1 is full weighting

Levy et al. found that p = 0.5 works best — take the square root of the singular values. This is a well-known trick in information retrieval (Latent Semantic Analysis), but it had never been systematically tested against Word2Vec.

Eigenvalue weighting: why p = 0.5?

The intuition is balance. With p = 0, all dimensions contribute equally — this ignores the fact that top singular values capture more important structure. With p = 1, the top few dimensions dominate so strongly that the remaining dimensions are nearly useless. p = 0.5 provides the Goldilocks balance: respect the importance ordering but don't let any dimension overwhelm the rest.

SVD Eigenvalue Weighting

Adjust the power parameter p to see how singular value weighting changes the effective contribution of each dimension. The bars show the relative weight of the top 20 singular values.

p (eigenvalue power)0.50

python
import numpy as np
from scipy.sparse.linalg import svds

# Build PPMI matrix (sparse V x V)
ppmi = build_ppmi_matrix(corpus, window=5)

# Truncated SVD to d dimensions
d = 300
U, S, Vt = svds(ppmi, k=d)

# Eigenvalue weighting with p=0.5
p = 0.5
word_vectors = U @ np.diag(S ** p)

# Now word_vectors is (V, d) — each row is a word embedding
# Normalize for cosine similarity
norms = np.linalg.norm(word_vectors, axis=1, keepdims=True)
word_vectors = word_vectors / (norms + 1e-8)

Implementation note: For large vocabularies, use scipy's sparse SVD (svds) which works directly on sparse matrices without materializing the full dense V × V matrix. This makes the count-based pipeline practical for vocabularies of 100K+ words.

Why not more dimensions?

You might think: if 300 is good, 3000 should be better. More dimensions = more information preserved. But there's a tradeoff. Low singular values capture noise, not signal. Including them hurts similarity computations by adding noise dimensions that dilute the meaningful ones. The optimal d is usually between 200 and 500 — a point at which the singular values are still meaningfully above the noise floor.

Levy et al. tested d ∈ {100, 200, 300, 500, 1000} and found 300–500 to be the sweet spot across most benchmarks. Going beyond 500 rarely helps and sometimes hurts. This matches the Word2Vec default of 300 — another "hyperparameter" that was well-chosen but rarely questioned.

Adding word and context vectors

After SVD, we have word vectors W = U_dΣ_d^p and, if we want them, context vectors C = V_dΣ_d^p. Should we use just W, or W + C?

Pennington et al. (GloVe, 2014) found that ŵ = w + c consistently outperforms using w alone. The intuition: in a symmetric matrix like PPMI, words and their contexts play symmetric roles. The word "cat" is both a target and a context. Using only the word vector discards half the information. Adding the context vector folds it back in.

Levy et al. confirmed this for SVD too: W + C vectors consistently improve similarity scores by 2–5% across benchmarks. This was another trick hiding in plain sight — applicable to all methods but only used by GloVe.

Why does p = 0.5 (square root of singular values) work better than p = 0 or p = 1 for weighting SVD dimensions?

It makes computation faster It removes noise from the matrix It balances between equal weighting (p=0 ignores importance) and full weighting (p=1 lets top dimensions overwhelm), giving each dimension appropriate influence

Chapter 4: The Skip-gram Link

Now comes the key theoretical connection. Levy and Goldberg (2014) proved that Skip-gram with Negative Sampling (SGNS) implicitly factorizes a shifted PMI matrix. This is the foundation of the entire paper's argument.

What Skip-gram optimizes

Skip-gram trains two sets of vectors: word vectors W and context vectors C. For each observed word-context pair (w, c), it maximizes:

J = Σ_{(w,c) ∈ D} log σ(w · c) + k · E_{c' ~ P_n}[log σ(−w · c')]

Where σ is the sigmoid, k is the number of negative samples, and P_n is the noise distribution (typically P(c)^0.75).

The first term pushes observed (w, c) pairs to have high dot products. The second term pushes random (w, c') pairs to have low dot products. Together: real co-occurrences get positive scores, random pairs get negative scores.

The implicit matrix

Levy and Goldberg showed that when the embedding dimension is large enough, the optimal solution satisfies:

w · c = PMI(w, c) − log(k)

The word-context dot product converges to the PMI minus log(k), where k is the number of negative samples. In matrix form:

W · C^T = M^PMI − log(k) · 1

This is a shifted PPMI matrix (since negative values get clamped in practice by the limited-capacity embedding). The shift −log(k) acts like a threshold: only associations stronger than log(k) survive.

Why this equivalence matters: If Skip-gram is just factorizing a particular PMI matrix, then the "neural vs. count-based" distinction dissolves. Both methods start from the same underlying information (co-occurrence statistics). They differ only in how they perform the factorization: SVD does it exactly (for the given rank), while Skip-gram does it approximately via stochastic gradient descent. The question then becomes: does the factorization method matter, or do the pre-processing choices (what goes into the matrix) matter more?

Deriving the connection

The derivation is elegant. Fix w and c. The objective for this specific pair is:

J(w,c) = #(w,c) · log σ(w · c) + k · #(w) · P_n(c) · log σ(−w · c)

Take the derivative with respect to x = w · c and set to zero:

∂J/∂x = #(w,c) · σ(−x) − k · #(w) · P_n(c) · σ(x) = 0

Solving for x:

#(w,c) · (1 − σ(x)) = k · #(w) · P_n(c) · σ(x)

#(w,c) / (k · #(w) · P_n(c)) = σ(x) / (1 − σ(x)) = e^x

x = log(#(w,c) / (k · #(w) · P_n(c)))

w · c = log(#(w,c) · D / (#(w) · #(c))) − log(k) = PMI(w,c) − log(k)

(where we used P_n(c) = #(c)/D for the unsmoothed case).

The punchline: Skip-gram is not doing something fundamentally new. It is performing an approximate, stochastic factorization of a PMI matrix. The neural architecture is just a particular way to factorize the same information that count-based methods start with. The question becomes: is it the factorization method (neural vs SVD) that matters, or the details of what matrix gets factorized (hyperparameters)?

SVD vs SGD: two ways to factorize

Once we know both methods factorize PMI-like matrices, we can compare the factorization procedures directly:

Property	SVD	Skip-gram (SGD)
Optimality	Optimal rank-d approximation (Frobenius norm)	Approximate (stochastic, non-convex)
Determinism	Deterministic	Stochastic (varies by run)
Scalability	O(V²d) for sparse SVD	O(Cd) where C = corpus size
Regularization	None (exact factorization)	Implicit (SGD noise, early stopping)
Online updates	No (batch only)	Yes (process corpus in one pass)

These differences explain some of the performance variations. SVD is optimal but can overfit to noise in the PMI estimates. SGD's noise acts as implicit regularization, sometimes helping. SVD is batch (you need the full matrix), while Skip-gram is online (you process the corpus in one pass). For very large corpora, Skip-gram's online nature is a major practical advantage.

The GloVe question

GloVe (Global Vectors for Word Representation) by Pennington et al. (2014) was another contender. GloVe explicitly factorizes a log co-occurrence matrix using weighted least squares:

J = Σ_i,j f(X_ij) (w_i · ŵ_j + b_i + b̂_j − log X_ij)²

where f(X_ij) is a weighting function that downweights rare co-occurrences and caps very frequent ones. GloVe was claimed to outperform Word2Vec, but Levy et al. showed this was largely because GloVe used W + C vectors (word + context) while Word2Vec used only W. When both use W + C, the gap nearly vanishes.

In the paper's comprehensive comparison, GloVe performed similarly to SGNS and SVD when given the same hyperparameters. It was not significantly better or worse. This further supports the thesis: the hyperparameters matter, the algorithm doesn't.

What about context representation?

One hyperparameter the paper doesn't explore deeply is the choice of context representation. All methods in the paper use bag-of-words context: the context of a word is the set of words within a window, ignoring order. But you could also use:

Positional context: "cat_left-2" and "cat_right-2" are different context items. This captures word order.
Dependency context: Use syntactic dependencies instead of linear windows. "sat" in "cat sat on mat" would have context {nsubj_cat, prep_on}.
Character n-gram context: Include subword information (later exploited by fastText).

Levy et al. focus on the simplest bag-of-words context because it's the standard in both count-based and prediction-based methods. But the context representation choice is another axis of variation that can affect results — and it's orthogonal to the algorithm choice.

The negative sampling distribution in detail

Word2Vec's negative sampling draws "noise" words from P_n(w) = P(w)^0.75 / Z, where P(w) is the unigram frequency and Z normalizes. Why 0.75 specifically?

The exponent α = 0.75 smooths the distribution. With α = 1 (no smoothing), very frequent words dominate the noise distribution — "the" appears as a negative sample far too often, while rare words like "scepter" almost never appear. With α = 0 (uniform), every word is equally likely as a negative — but this wastes many samples on extremely rare words that are uninformative negatives.

α = 0.75 is a compromise. Mikolov et al. reported tuning this value on a validation set. Levy et al. show that applying the same α = 0.75 to count-based CDS gives nearly identical improvements. The "magic number" is not about the neural training procedure — it's about the statistical properties of natural language frequency distributions.

python
# Computing smoothed PPMI with context distribution smoothing
def compute_smoothed_ppmi(cooccurrence, alpha=0.75):
    """PPMI with context distribution smoothing (CDS)."""
    V = cooccurrence.shape[0]
    row_sums = cooccurrence.sum(axis=1)  # #(w)
    col_sums = cooccurrence.sum(axis=0)  # #(c)
    total = cooccurrence.sum()            # D

    # Smooth context distribution: #(c)^alpha
    smoothed_cols = col_sums ** alpha
    smoothed_total = smoothed_cols.sum()

    # PMI_alpha(w, c) = log(#(w,c) * D_alpha / (#(w) * #(c)^alpha))
    ppmi = np.zeros((V, V))
    for i in range(V):
        for j in range(V):
            if cooccurrence[i, j] > 0:
                pmi = np.log(
                    cooccurrence[i, j] * smoothed_total
                    / (row_sums[i] * smoothed_cols[j])
                )
                ppmi[i, j] = max(0, pmi)
    return ppmi

The complete recipe in one sentence: Subsample frequent words from the corpus (√(t/p(w))), count co-occurrences with harmonic distance weighting, compute PPMI with context distribution smoothing (α=0.75), factorize with truncated SVD (d=300) using Σ^0.5 weighting, add word and context vectors. This matches or beats Word2Vec on every benchmark tested.

Dirty vs. clean subsampling

The paper also distinguishes between "dirty" and "clean" subsampling. In Word2Vec's implementation, subsampling is applied during corpus processing. This means the context window can span across removed words. If "the" is removed from "the cat sat," then "cat" and "sat" are now at distance 1 instead of distance 2. This dirty subsampling effectively enlarges the context window for content words.

Clean subsampling removes words first, then builds the co-occurrence matrix on the remaining corpus. The original positional distances are preserved. Surprisingly, both approaches work similarly well. The paper tests both and finds dirty subsampling is slightly better for analogy tasks and clean subsampling is slightly better for similarity — but the difference is small.

This is another instance of the paper's central theme: a design choice that seems fundamental (dirty vs. clean) turns out to matter much less than the bigger decisions (subsampling vs. no subsampling).

Window size: the most underappreciated hyperparameter

Window size L controls what kind of similarity the embeddings capture. This is one of the paper's most practically useful findings:

Small window (L = 2):

Captures paradigmatic relations
"Cat" is near "dog" (substitutable)
Best for word similarity tasks
Narrow, focused neighborhoods

Large window (L = 5–10):

Captures syntagmatic relations
"Cat" is near "mouse" (co-occurring)
Best for analogy and topic tasks
Broad, topical neighborhoods

The linguistic distinction between paradigmatic and syntagmatic relations dates back to Saussure (1916). Paradigmatic relations hold between words that can substitute for each other ("I like cats/dogs/birds"). Syntagmatic relations hold between words that co-occur ("cat chased mouse"). These are fundamentally different types of semantic relationship, and the window size determines which type dominates your embeddings.

This finding has direct practical implications: if you need embeddings for semantic similarity (paraphrase detection, synonym expansion), use L = 2. If you need embeddings for topic modeling or information retrieval, use L = 5+. There is no universally optimal window size; it depends on your application.

Interaction effects between hyperparameters

The paper also reports important interaction effects between hyperparameters. Some combinations work better than expected, and others worse:

Subsampling + CDS: These two tricks are partially redundant. Both address the frequency imbalance problem (subsampling removes frequent word tokens; CDS smooths the frequency distribution in PMI). Using both helps, but the combined improvement is less than the sum of individual improvements.
Large window + Subsampling: These interact positively. Subsampling removes function words, effectively making the window larger. A window of L=2 with subsampling behaves more like L=4 without subsampling, because the intervening function words are gone.
Dimensionality + Eigenvalue weighting: For low d (100), eigenvalue weighting with p=0.5 helps a lot. For high d (500), it helps less because the additional dimensions already spread the variance more evenly.
SVD + W+C: Adding context vectors (W+C) helps SVD more than it helps SGNS. This makes sense: SVD produces clean, separated W and C matrices, so adding them combines genuinely different information. SGNS's word and context vectors are entangled by the stochastic optimization, so adding them is noisier.

These interactions mean you cannot simply pick the "best" value for each hyperparameter independently. You need to consider the full configuration. This is why the 512-configuration factorial design was necessary.

But Levy et al.'s key finding is that these differences are small compared to the hyperparameter effects. Whether you factorize optimally (SVD) or approximately (SGD) matters less than what you factorize (which hyperparameters you use to construct the matrix).

Skip-gram as Implicit Matrix Factorization

The left shows the PMI matrix. Adjust k (negative samples) to see how the shifted PMI matrix changes — this is what Skip-gram implicitly factorizes.

k (neg samples)5

What matrix does Skip-gram with Negative Sampling implicitly factorize?

The raw co-occurrence count matrix A shifted PMI matrix: W · C^T = PMI(w,c) − log(k), where k is the number of negative samples The identity matrix

Chapter 5: Hyperparameters

This is where the paper makes its most impactful contribution. Levy et al. identified the specific hyperparameters that drive performance, independent of whether you use a neural or count-based method. Each hyperparameter, when transferred from Word2Vec to count-based methods, closes the performance gap.

1. Pre-processing: Subsampling frequent words

Word2Vec aggressively subsamples frequent words before building any co-occurrence statistics. Words appearing with probability p(w) are kept with probability:

P_keep(w) = min(1, √(t / p(w)))

where t is a threshold (typically 10⁻⁵). This removes a huge fraction of "the," "a," "is" tokens. The effect: the effective context window grows (because function words that separated content words are removed), and content words get more direct neighbors.

Why this matters for count-based methods: Nobody was doing subsampling before counting co-occurrences. They built the co-occurrence matrix on the raw corpus. But if you apply Word2Vec's subsampling to the corpus before counting, count-based methods improve dramatically. It's the same trick — just applied before step 1 instead of being baked into step 2.

Let's work through the math. Consider the word "the" which appears with probability p("the") = 0.05 in English. With threshold t = 10⁻⁵:

P_keep("the") = √(10⁻⁵ / 0.05) = √(0.0002) ≈ 0.014

Only 1.4% of "the" tokens survive! For a content word like "algorithm" with p("algorithm") = 10⁻⁶:

P_keep("algorithm") = min(1, √(10⁻⁵ / 10⁻⁶)) = min(1, √10) = 1.0

All instances of "algorithm" survive. The effect on the corpus is dramatic: 98.6% of "the" tokens are removed, leaving content words with direct contact to each other. The effective context window for content words nearly doubles, because the function words that used to separate them are gone.

This single preprocessing step typically improves word similarity scores by 5–10% across all methods. It's the single most impactful hyperparameter in the entire study, and it was never applied to count-based methods before this paper.

Subsampling Survival Rates

Adjust the threshold t to see which words survive subsampling. Function words (the, a, is) are aggressively removed; content words survive intact.

Threshold (t)1e-5

2. Dynamic context windows

Standard count-based methods use a fixed context window of size L: all context words within L positions get equal weight. Word2Vec uses a dynamic window: for each word, it samples a window size uniformly from 1 to L. Closer words are always counted; distant words are counted only sometimes.

This is equivalent to distance-weighted co-occurrence counting: a word at distance d gets weight (L − d + 1) / L. Nearby words matter more than far-away words — which makes linguistic sense.

To see why, consider the phrase "the big red ball bounced." With a fixed window of 5, "the" and "bounced" get the same co-occurrence weight as "big" and "red." But "big red" is a genuine collocation; "the...bounced" is coincidental proximity. The dynamic window, by sometimes using a size of 1 or 2, gives more weight to immediate neighbors.

The math: if the window is sampled uniformly from {1, 2, ..., L}, then a word at distance d is included whenever the sampled window ≥ d. The probability is (L − d + 1)/L. So for L = 5: distance 1 gets weight 1.0, distance 2 gets 0.8, distance 3 gets 0.6, distance 4 gets 0.4, distance 5 gets 0.2. This harmonic weighting emerges naturally from the uniform sampling.

3. Context distribution smoothing (α = 0.75)

We saw this in Chapter 2. Word2Vec's negative sampling uses P(c)^0.75 instead of P(c). This is equivalent to modifying the PMI computation. Applying the same smoothing to count-based PMI gives a substantial boost.

4. Shifted PMI (negative samples as threshold)

Skip-gram with k negative samples implicitly shifts PMI by −log(k). This acts as a threshold: weak associations (PMI < log(k)) become negative and get clamped to zero by the limited-capacity embedding. For count-based methods, you can achieve the same effect explicitly:

SPPMI(w, c) = max(0, PMI(w, c) − log(k))

5. Vector dimensionality and adding context vectors

Two more tricks that help: (a) using d = 300–500 dimensions (not too few, not too many), and (b) after factorization, adding the word vector and the context vector: ŵ = w + c. Word2Vec typically discards context vectors, but Pennington et al. (GloVe, 2014) showed that adding them helps. This works for SVD too.

Hyperparameter Impact Analysis

Toggle each hyperparameter on/off to see its marginal impact on word similarity performance (SimLex-999). Start with all off (baseline count-based) and add them one by one.

Which hyperparameter modification has the most surprising effect when transferred from Word2Vec to count-based methods?

Subsampling frequent words before building the co-occurrence matrix — a preprocessing step that nobody applied to count-based methods, yet it dramatically improves them by effectively growing the context window around content words Using a larger vocabulary Training for more epochs

Chapter 6: The Showdown

Now for the big comparison. Levy et al. tested all methods on multiple word similarity and analogy benchmarks, with all hyperparameters systematically controlled. Here are the methods:

Method	Type	What it does
PPMI	Count	PPMI matrix, full-dimensional (no SVD)
SVD	Count	PPMI + truncated SVD to d dimensions
SGNS	Predict	Skip-gram with negative sampling (Word2Vec)
GloVe	Hybrid	Weighted least-squares on log co-occurrence

Each method was tested with every combination of the hyperparameters from Chapter 5: subsampling (yes/no), dynamic window (yes/no), CDS (yes/no), dirty/clean subsampling, and various dimensionalities.

The results

The headline finding: No single method consistently outperforms the others across all tasks. When given the same hyperparameter tuning, PPMI+SVD matches or beats SGNS on most word similarity benchmarks. The choice of method matters less than the choice of hyperparameters.

Specific findings across benchmarks:

Word similarity (WS-353, SimLex-999): SVD with optimal hyperparameters matches SGNS. PPMI (without SVD) is competitive too.
Analogy tasks (Google, MSR): SGNS has a slight edge on syntactic analogies, but SVD wins on semantic analogies when properly tuned.
Across all tasks: The variance due to hyperparameter choice is much larger than the variance due to algorithm choice. Picking the right hyperparameters matters 5-10x more than picking the right algorithm.

Task-specific observations

The results are nuanced — different tasks favor different configurations:

Word similarity tasks favor small context windows (L = 2). Narrow windows capture paradigmatic relations (words that substitute for each other: dog/cat/horse). These are what similarity benchmarks test.
Analogy tasks favor larger context windows (L = 5–10). Wider windows capture syntagmatic relations (words that co-occur: doctor/hospital). Analogy tasks need both types.
SVD outperforms on similarity when properly tuned, because the best rank-d approximation preserves the most important structure. SGNS slightly wins on syntactic analogies, perhaps because SGD's noise acts as a useful regularizer.

What each hyperparameter does (the leaderboard)

The paper provides a detailed ranking of hyperparameter impact:

Subsampling — the biggest single factor. Improves all methods.
Context distribution smoothing — large impact, especially for count-based methods.
Dynamic context window — moderate impact, helps word similarity more than analogies.
Vector addition (W+C) — helps consistently across all methods.
Eigenvalue weighting (p=0.5) — SVD-specific, but substantial when applicable.

Method Comparison Across Benchmarks

Compare methods with matching hyperparameters. Toggle between benchmarks. The key insight: with the same tuning, methods converge.

python
# The full pipeline that matches Word2Vec, from Levy et al.

def levy_pipeline(corpus, d=300, window=5, alpha=0.75, p=0.5, sub_t=1e-5):
    # Step 1: Subsample frequent words (like Word2Vec)
    corpus = subsample(corpus, threshold=sub_t)

    # Step 2: Count co-occurrences with dynamic window
    counts = count_cooccurrences(corpus, window=window, weighted=True)

    # Step 3: Compute PPMI with context distribution smoothing
    ppmi = compute_ppmi(counts, alpha=alpha)

    # Step 4: Truncated SVD
    U, S, Vt = svds(ppmi, k=d)

    # Step 5: Eigenvalue weighting p=0.5
    W = U @ np.diag(S ** p)
    C = Vt.T @ np.diag(S ** p)

    # Step 6: Add word + context vectors
    vectors = W + C
    return vectors

What is the paper's central finding about algorithm choice vs. hyperparameter choice?

The variance due to hyperparameter choice is much larger than the variance due to algorithm choice — picking the right hyperparameters matters 5-10x more than picking SGNS vs SVD vs PPMI Neural methods always win Count-based methods always win

Chapter 7: Connections

What this paper built on

Distributional Semantics (Firth, 1957; Harris, 1954): The foundational idea that meaning is determined by context. "You shall know a word by the company it keeps."

Latent Semantic Analysis (Deerwester et al., 1990): Applied SVD to term-document matrices for information retrieval. Levy et al. extended this to word-context matrices with modern tricks.

Word2Vec (Mikolov et al., 2013): The Skip-gram and CBOW models that triggered the neural embeddings revolution. Levy et al. showed its innovations were portable to count-based methods.

Neural Word Embedding as Implicit Matrix Factorization (Levy & Goldberg, 2014): The authors' own prior paper proving W · C^T = PMI − log(k). This theoretical result motivated the current paper's experiments.

What this paper enabled

GloVe re-evaluation: Pennington et al. (2014) claimed GloVe outperformed Word2Vec. Levy et al. showed this was largely due to GloVe using W+C vectors while Word2Vec used only W. When controlled, the gap shrinks to near zero. This is a case study in the paper's meta-lesson: the "improvement" was an implementation detail, not an algorithmic advance.

fastText (Bojanowski et al., 2017): Incorporated subword information into Word2Vec. The hyperparameter awareness from Levy et al. influenced the careful ablation studies in fastText. Bojanowski et al. systematically varied window size, dimensionality, and subsampling — a direct inheritance from the Levy et al. methodology.

Evaluation methodology: This paper set the standard for fair comparison in NLP: control all confounders, test all combinations, report variance. It changed how the community evaluates embedding methods.

Contextual word representations: The paper's findings indirectly contributed to the shift toward contextual representations (ELMo, BERT). If static word vectors' quality depends mostly on hyperparameters rather than the method, and if no method consistently wins, perhaps the entire paradigm of static word vectors has reached a ceiling. The natural next step: make representations context-dependent. Within three years of this paper, ELMo (2018) and BERT (2018) made static word embeddings largely obsolete for downstream tasks.

The paper's impact by the numbers

As of 2025, this paper has been cited over 3,000 times. It fundamentally changed the NLP community's approach to:

Fair comparison: Papers now routinely control hyperparameters when comparing methods
Ablation studies: The practice of systematically ablating design choices became standard
Reproducibility: The paper's publicly available code enabled others to replicate and build on the findings
Skepticism toward claims: "X beats Y" papers are now expected to control for all confounders

The paper's influence extends beyond NLP. In computer vision, the same lesson applies: is ResNet better than VGG because of skip connections, or because of batch normalization, learning rate scheduling, and data augmentation? In RL, is PPO better than TRPO because of clipping, or because of larger batch sizes? Levy et al. taught the ML community to ask these questions systematically.

Modern relevance

As of 2025, the paper's central insight — that engineering choices often matter more than algorithms — has been vindicated repeatedly across the ML landscape:

Scaling laws (Kaplan et al., 2020): Model architecture matters less than scale (data, compute, parameters). The specific transformer variant matters less than making it bigger.
Data quality (Raffel et al., 2020): Careful data curation (deduplication, filtering, balancing) often improves performance more than algorithmic innovations.
Training recipes (Touvron et al., 2023): Llama's improvements over prior models came largely from training recipe optimizations (cosine learning rate, specific batch size schedules) rather than architectural changes.
Hyperparameter transfer (Yang et al., 2022): The muP parameterization enables hyperparameter transfer across model scales. The hyperparameters matter so much that having a principled way to set them is itself a major contribution.

Levy et al.'s 2015 paper was ahead of its time. The word embedding community learned the lesson in 2015. The broader ML community is still learning it in 2025.

Closing thought

Omer Levy and Yoav Goldberg wrote in their earlier paper: "Word2Vec is not magic." This follow-up paper proved it. The magic was never in the neural network. It was in the data processing pipeline — in decisions about which words to count, how to weight co-occurrences, and how to handle the overwhelming dominance of frequent words. These decisions were made by talented engineers at Google, tested on internal benchmarks, and bundled into a software release. The NLP community then attributed the improvements to the neural architecture, because that was the novel and exciting part.

The paper's enduring contribution is a methodology and a mindset: when comparing methods, compare fairly. When claiming improvements, isolate the source. When building systems, tune the hyperparameters at least as carefully as you design the algorithm. In machine learning, as in life, the boring details often matter more than the flashy architecture.

Practical recommendations from the paper

The paper concludes with concrete guidelines for practitioners:

Decision	Recommendation	Why
Algorithm	Either SGNS or SVD	No consistent winner — pick based on infrastructure
Preprocessing	Always subsample frequent words	Biggest single improvement for all methods
Window	L=2 for similarity, L=5 for analogies	Narrow captures substitutability; wide captures association
Dimensions	d=300–500	Sweet spot; more adds noise, fewer loses signal
Context smoothing	α = 0.75	Matches Word2Vec's implicit choice; robust across tasks
SVD weighting	p = 0.5	Balance between equal and full eigenvalue weighting
Vector type	Use W + C (word + context vectors)	Consistent 2–5% improvement on all tasks

The lasting lesson

Beyond word embeddings: The paper's deepest lesson transcends word vectors. In ML research, when Method B beats Method A, the improvement might come from: (1) the algorithm itself, (2) better hyperparameters, (3) engineering tricks baked into the implementation, or (4) more compute. Levy et al. showed that for word embeddings, the answer was (2) and (3), not (1). This cautionary tale applies to every "X beats Y" claim in ML. Always ask: are we comparing methods or implementations?

This insight has aged remarkably well. In the deep learning era, we see the same pattern repeat: BERT vs. GPT-2 vs. XLNet differences shrink when you control for compute and data. Scaling laws show that architecture matters less than scale. The lesson from 2015 word embeddings generalizes to 2020s language models: it's almost never the architecture.

A timeline of the debate

Year	Paper	Claim
2003	Bengio et al.	Neural language models learn useful word representations
2010	Turney & Pantel	Survey of count-based distributional methods (VSMs)
2013	Mikolov et al.	Word2Vec: prediction-based methods dominate count-based
2014	Pennington et al.	GloVe: hybrid method outperforms both
2014	Levy & Goldberg	Skip-gram implicitly factorizes PMI — theoretical link
2015	Levy et al.	It's the hyperparameters, not the algorithm
2016	Arora et al.	Random walk theory explains why PMI-based embeddings work
2017	Bojanowski et al.	fastText: subword embeddings, careful ablation following Levy

The paper sits at a pivotal point in this timeline. Before it, the field was caught in a "neural vs. traditional" narrative. After it, the field understood that the distinction is largely meaningless — both approaches manipulate the same underlying co-occurrence statistics.

The Great Convergence

As hyperparameters are equalized between methods, performance converges. Watch how adding each trick to count-based methods closes the gap with SGNS.

Cheat sheet

Core finding

PPMI+SVD matches Word2Vec when given the same hyperparameter tuning

Key equation

W · C^T = PMI(w,c) − log(k) — Skip-gram implicitly factorizes shifted PMI

Top hyperparameters

Subsampling > CDS (α=0.75) > Dynamic window > W+C addition > Eigenvalue weighting (p=0.5)

Best count-based recipe

Subsample → Dynamic window counts → Smoothed PPMI → SVD (d=300) → Σ^0.5 weighting → W+C

Meta-lesson

When comparing methods, control hyperparameters. Variance from tuning dwarfs variance from algorithm choice.

What is the most important meta-lesson from this paper for ML research in general?

When Method B beats Method A, the improvement may come from hyperparameters and implementation tricks rather than the algorithm itself — always ask whether you're comparing methods or implementations Neural methods are always better SVD should be used everywhere