Most gains attributed to neural word embeddings come from hyperparameter tuning, not the neural architecture. Traditional count-based methods match Word2Vec when given the same treatment.
By 2013, word embeddings had taken NLP by storm. Word2Vec's Skip-gram model produced vectors where king − man + woman ≈ queen. It seemed like magic, and the prevailing narrative was clear: neural methods have fundamentally surpassed traditional count-based approaches.
But Levy, Goldberg, and Dagan asked an uncomfortable question: is it really the neural architecture that's responsible for these gains, or is it something else entirely?
The traditional pipeline was simple. You count how often words co-occur in a corpus, build a big matrix, maybe apply some weighting scheme like Pointwise Mutual Information (PMI), and reduce dimensions with Singular Value Decomposition (SVD). This approach dominated for decades. Then Word2Vec arrived, and everyone assumed the improvement came from the prediction-based neural training procedure.
The authors noticed something suspicious. Word2Vec didn't just change the algorithm — it also introduced a suite of clever engineering tricks: dynamic context windows, subsampling of frequent words, negative sampling with a smoothed unigram distribution. These tricks were baked into the implementation and never tested independently.
Consider the timeline. In 2003, Bengio published neural language models. In 2008, Collobert and Weston showed learned word vectors could help NLP tasks. In 2013, Mikolov et al. released Word2Vec with Skip-gram and CBOW. The prevailing story was a clean arc: count → predict, and the predict paradigm won.
But the comparisons were never fair. Word2Vec's implementation included: (a) subsampling frequent words with probability √(t/f(w)), (b) using a dynamic context window sampled from 1 to L rather than a fixed L, (c) raising the noise distribution to the 0.75 power, (d) using a specific number of negative samples as an implicit threshold. Traditional count-based papers used none of these tricks. They were comparing an optimized race car to a stock sedan.
The paper systematically transfers each trick from Word2Vec to count-based methods and measures the impact. The conclusion is stark: there is no consistent winner among the algorithms. The hyperparameters matter far more than whether you use neural prediction or matrix factorization.
What makes this paper rigorous is its systematic factorial experimental design. Instead of testing each method with its "default" settings (which would conflate algorithm choice with configuration choice), the authors test every combination of:
| Factor | Values tested | Combinations |
|---|---|---|
| Algorithm | PPMI, SVD, SGNS, GloVe | 4 |
| Subsampling | None, t=10−5 | 2 |
| Context window | Fixed-2, Fixed-5, Dynamic-2, Dynamic-5 | 4 |
| CDS (α) | 1.0 (none), 0.75 | 2 |
| Dimensions | 100, 200, 300, 500 | 4 |
| Vector type | W only, W + C | 2 |
That's 4 × 2 × 4 × 2 × 4 × 2 = 512 configurations per benchmark. This comprehensive grid allows the paper to isolate the effect of each factor while averaging over all others. When they say "subsampling helps more than algorithm choice," this is backed by 512 data points, not a cherry-picked comparison of 2 configurations.
The analysis uses ANOVA-style decomposition of the total variance. For each benchmark, the paper computes: how much variance is explained by the algorithm factor alone? How much by subsampling? By window type? By CDS? The answer: algorithm choice explains about 10–15% of variance, while subsampling explains 25–35%. Combined, the hyperparameter factors explain 4–5x more variance than the algorithm factor.
The scale of this experiment was unprecedented in the word embedding literature. Most papers compare 2–3 configurations. Levy et al. compare 512. This thoroughness is what makes the conclusions so convincing.
Before this paper, several studies compared count-based and prediction-based methods and reached different conclusions. The reason: they used different defaults. Here's a reconstruction of how this happened:
| Study | Count-based config | Neural config | Winner | Why |
|---|---|---|---|---|
| Baroni et al., 2014 | Standard PPMI+SVD, no subsampling | Word2Vec with all tricks | Neural | Unfair: tricks not transferred |
| Pennington et al., 2014 | W2V without W+C | GloVe with W+C | GloVe | Unfair: W+C not transferred |
| Levy et al., 2015 | PPMI+SVD with ALL tricks | W2V with ALL tricks | Tie | Fair: all confounders controlled |
Each prior study reached a different conclusion because it compared different things. Levy et al. resolved the contradiction by controlling everything. The lesson: when two papers disagree about which method is better, the first question to ask is "are they comparing the same configurations?"
Toggle to see the prevailing narrative vs. what Levy et al. actually found. Click a method to see its components.
Before we can compare methods fairly, we need to understand the count-based pipeline in detail. It starts with the simplest possible idea: you shall know a word by the company it keeps (Firth, 1957).
Given a corpus, we build a matrix M where Mij = how many times word i appears near word j within a context window of size L. "Near" means within L words to the left or right.
For a vocabulary of V words, this gives us a V × V matrix. Each row is a V-dimensional vector representing one word. Words that appear in similar contexts will have similar rows.
Raw co-occurrence counts have a fatal flaw: frequent words dominate everything. The word "the" co-occurs with practically every word in the vocabulary. If you compute cosine similarity between raw count vectors, every word looks similar because they all co-occur with "the," "a," "is," etc.
The solution: Pointwise Mutual Information (PMI). Instead of asking "how often do i and j co-occur?", we ask "how much more do i and j co-occur than we'd expect if they were independent?"
Where P(i, j) is the probability of seeing i and j together, and P(i) · P(j) is what we'd expect by chance. If PMI is positive, the words co-occur more than chance. If negative, they avoid each other.
The PMI matrix is V × V, which for a 100K vocabulary means 10 billion entries. Most are zero (sparse), but it's still unwieldy. We use SVD to compress it to V × d where d might be 300. This captures the most important structure in a dense, low-dimensional representation.
This pipeline was the state of the art for over a decade. Latent Semantic Analysis (Deerwester et al., 1990) used a term-document matrix variant. Hyperspace Analogue to Language (Lund & Burgess, 1996) used word-word co-occurrence. The specific combination of PPMI + truncated SVD was explored by Bullinaria and Levy (2007, 2012) and Turney and Pantel (2010).
The pipeline has two major advantages: it's interpretable (you know exactly what the matrix entries mean) and it's reproducible (SVD is deterministic given the same input). Its disadvantage, pre-Levy-et-al., was supposed to be performance. But that disadvantage turns out to be a myth.
Click each stage to highlight it and see its input/output shapes for a real corpus (V = 100K vocabulary, d = 300 embedding dimensions).
Watch how a context window slides over a sentence to build co-occurrence counts. Adjust the window size to see the effect.
python import numpy as np from collections import Counter, defaultdict def build_cooccurrence(corpus, vocab, window=5, weighted=False): """Build V x V co-occurrence matrix from tokenized corpus.""" word2idx = {w: i for i, w in enumerate(vocab)} V = len(vocab) M = np.zeros((V, V)) for i, word in enumerate(corpus): if word not in word2idx: continue wi = word2idx[word] for d in range(1, window + 1): if i + d < len(corpus) and corpus[i + d] in word2idx: ci = word2idx[corpus[i + d]] # Dynamic window: weight = (L - d + 1) / L weight = (window - d + 1) / window if weighted else 1.0 M[wi, ci] += weight M[ci, wi] += weight return M def subsample(corpus, freqs, threshold=1e-5): """Subsample frequent words (Word2Vec's preprocessing trick).""" total = sum(freqs.values()) return [ w for w in corpus if np.random.random() < min(1.0, np.sqrt(threshold / (freqs[w] / total))) ]
PMI has a problem: when two words never co-occur, PMI = log(0) = −∞. And even for rare but genuine co-occurrences, PMI gives unreliably high values (two rare words that co-occur once get a very high PMI just because their marginal probabilities are tiny).
The standard fix: clamp negative values to zero.
This is called Positive PMI or PPMI. It says: "only keep pairs that co-occur more than expected. If they co-occur less than expected or not at all, just record zero." This turns out to work very well in practice.
Negative PMI values tell you "these words co-occur less than expected." In principle, that's information. But in practice, it's noise. The co-occurrence matrix is extremely sparse — most word pairs never co-occur at all, not because they avoid each other, but because we haven't seen enough data. Setting negative PMI to zero acknowledges this sparsity problem and focuses on the positive signal.
There's a deeper reason too. When we use these vectors for cosine similarity, negative entries create a problem: they can make unrelated words appear similar by giving them both negative values in the same dimensions. Clamping to zero avoids this artifact.
Let's be precise. Given a corpus with co-occurrence counts #(w, c) for word w and context c within a window of size L:
Where D = Σw,c #(w, c) is the total number of (word, context) pairs, and #(w) = Σc #(w, c) is the marginal count of word w. In words: we compare the observed joint frequency to what we'd expect under independence.
Let's work through a concrete example. Suppose our corpus has D = 1,000,000 word-context pairs. The word "cat" appears in 500 pairs total, "mat" appears in 200 pairs total, and "cat" and "mat" co-occur 10 times.
That's a strong positive association. Now consider "cat" and "the" (a function word). If #(the) = 50,000 and #(cat, the) = 25:
"Cat" and "the" co-occur exactly as much as expected by chance — PMI is zero. If we used raw counts, "the" would dominate the representation of "cat." PMI correctly identifies this as uninformative.
Word2Vec with k negative samples introduces an implicit threshold. Words whose PMI is below log(k) get treated as noise. The count-based equivalent is Shifted PPMI:
With k = 5 negative samples (a common default), the shift is log(5) ≈ 1.6. Only associations with PMI above 1.6 survive. This is more aggressive than plain PPMI (which only removes negative PMI) and helps eliminate weak, noisy associations.
One of the key insights from Levy et al. is that Word2Vec's negative sampling implicitly modifies the PMI computation. Specifically, negative sampling with the smoothed unigram distribution P(c)0.75 is equivalent to computing a modified PMI:
where α = 0.75 is the smoothing exponent. This raises rare context words' probabilities, making them more likely to be sampled as negatives. The effect: rare-word associations get penalized less, which helps significantly.
See how PMI and PPMI scores change for different word pairs. Adjust the smoothing exponent α to see how context distribution smoothing (CDS) affects scores.
Let's compute PPMI for the sentence "the cat sat on the mat the dog sat on the rug" with a window of 2. First, the counts:
| the | cat | sat | on | mat | dog | rug | |
|---|---|---|---|---|---|---|---|
| the | 0 | 2 | 1 | 2 | 2 | 2 | 2 |
| cat | 2 | 0 | 1 | 0 | 0 | 0 | 0 |
| sat | 1 | 1 | 0 | 2 | 0 | 1 | 0 |
Total co-occurrences D = sum of all entries. For "cat" and "sat": #(cat,sat) = 1, #(cat) = 3 (total count for cat row), #(sat) = 5 (total for sat column). PMI(cat, sat) = log(1 · D / (3 · 5)). With D ≈ 30: PMI ≈ log(30/(15)) = log(2) = 0.69. Positive — they co-occur more than chance.
For "the" and "cat": #(the,cat) = 2, #(the) = 11, #(cat) = 3. PMI = log(2 · 30 / (11 · 3)) = log(60/33) = log(1.82) = 0.60. Also positive, but this high value for "the" is misleading — it's driven by "the" being very frequent. With CDS (α=0.75), the effective count for "the" would be raised less, reducing this spurious association.
We have a sparse PPMI matrix of size V × V. We need to compress it to V × d where d ≈ 300. Truncated Singular Value Decomposition is the classic tool for this.
SVD decomposes any matrix M into three matrices:
Where U is V × V (left singular vectors), Σ is a diagonal matrix of singular values (sorted in decreasing order), and VT is V × V (right singular vectors).
Truncated SVD keeps only the top d singular values and their associated vectors:
This is the best rank-d approximation of M in the Frobenius norm sense. No other rank-d matrix gets closer to M.
Here's a subtle but important choice. After SVD, we have Ud and Σd. The word vectors could be:
| Choice | Formula | Effect |
|---|---|---|
| W = Ud | Just the left singular vectors | All dimensions weighted equally |
| W = Ud · Σd | Singular vectors scaled by singular values | Top dimensions dominate |
| W = Ud · Σdp | Power-weighted singular values | Tunable: p=0 is U, p=1 is full weighting |
Levy et al. found that p = 0.5 works best — take the square root of the singular values. This is a well-known trick in information retrieval (Latent Semantic Analysis), but it had never been systematically tested against Word2Vec.
The intuition is balance. With p = 0, all dimensions contribute equally — this ignores the fact that top singular values capture more important structure. With p = 1, the top few dimensions dominate so strongly that the remaining dimensions are nearly useless. p = 0.5 provides the Goldilocks balance: respect the importance ordering but don't let any dimension overwhelm the rest.
Adjust the power parameter p to see how singular value weighting changes the effective contribution of each dimension. The bars show the relative weight of the top 20 singular values.
python import numpy as np from scipy.sparse.linalg import svds # Build PPMI matrix (sparse V x V) ppmi = build_ppmi_matrix(corpus, window=5) # Truncated SVD to d dimensions d = 300 U, S, Vt = svds(ppmi, k=d) # Eigenvalue weighting with p=0.5 p = 0.5 word_vectors = U @ np.diag(S ** p) # Now word_vectors is (V, d) — each row is a word embedding # Normalize for cosine similarity norms = np.linalg.norm(word_vectors, axis=1, keepdims=True) word_vectors = word_vectors / (norms + 1e-8)
You might think: if 300 is good, 3000 should be better. More dimensions = more information preserved. But there's a tradeoff. Low singular values capture noise, not signal. Including them hurts similarity computations by adding noise dimensions that dilute the meaningful ones. The optimal d is usually between 200 and 500 — a point at which the singular values are still meaningfully above the noise floor.
Levy et al. tested d ∈ {100, 200, 300, 500, 1000} and found 300–500 to be the sweet spot across most benchmarks. Going beyond 500 rarely helps and sometimes hurts. This matches the Word2Vec default of 300 — another "hyperparameter" that was well-chosen but rarely questioned.
After SVD, we have word vectors W = UdΣdp and, if we want them, context vectors C = VdΣdp. Should we use just W, or W + C?
Pennington et al. (GloVe, 2014) found that ŵ = w + c consistently outperforms using w alone. The intuition: in a symmetric matrix like PPMI, words and their contexts play symmetric roles. The word "cat" is both a target and a context. Using only the word vector discards half the information. Adding the context vector folds it back in.
Levy et al. confirmed this for SVD too: W + C vectors consistently improve similarity scores by 2–5% across benchmarks. This was another trick hiding in plain sight — applicable to all methods but only used by GloVe.
Now comes the key theoretical connection. Levy and Goldberg (2014) proved that Skip-gram with Negative Sampling (SGNS) implicitly factorizes a shifted PMI matrix. This is the foundation of the entire paper's argument.
Skip-gram trains two sets of vectors: word vectors W and context vectors C. For each observed word-context pair (w, c), it maximizes:
Where σ is the sigmoid, k is the number of negative samples, and Pn is the noise distribution (typically P(c)0.75).
The first term pushes observed (w, c) pairs to have high dot products. The second term pushes random (w, c') pairs to have low dot products. Together: real co-occurrences get positive scores, random pairs get negative scores.
Levy and Goldberg showed that when the embedding dimension is large enough, the optimal solution satisfies:
The word-context dot product converges to the PMI minus log(k), where k is the number of negative samples. In matrix form:
This is a shifted PPMI matrix (since negative values get clamped in practice by the limited-capacity embedding). The shift −log(k) acts like a threshold: only associations stronger than log(k) survive.
The derivation is elegant. Fix w and c. The objective for this specific pair is:
Take the derivative with respect to x = w · c and set to zero:
Solving for x:
(where we used Pn(c) = #(c)/D for the unsmoothed case).
Once we know both methods factorize PMI-like matrices, we can compare the factorization procedures directly:
| Property | SVD | Skip-gram (SGD) |
|---|---|---|
| Optimality | Optimal rank-d approximation (Frobenius norm) | Approximate (stochastic, non-convex) |
| Determinism | Deterministic | Stochastic (varies by run) |
| Scalability | O(V2d) for sparse SVD | O(Cd) where C = corpus size |
| Regularization | None (exact factorization) | Implicit (SGD noise, early stopping) |
| Online updates | No (batch only) | Yes (process corpus in one pass) |
These differences explain some of the performance variations. SVD is optimal but can overfit to noise in the PMI estimates. SGD's noise acts as implicit regularization, sometimes helping. SVD is batch (you need the full matrix), while Skip-gram is online (you process the corpus in one pass). For very large corpora, Skip-gram's online nature is a major practical advantage.
GloVe (Global Vectors for Word Representation) by Pennington et al. (2014) was another contender. GloVe explicitly factorizes a log co-occurrence matrix using weighted least squares:
where f(Xij) is a weighting function that downweights rare co-occurrences and caps very frequent ones. GloVe was claimed to outperform Word2Vec, but Levy et al. showed this was largely because GloVe used W + C vectors (word + context) while Word2Vec used only W. When both use W + C, the gap nearly vanishes.
In the paper's comprehensive comparison, GloVe performed similarly to SGNS and SVD when given the same hyperparameters. It was not significantly better or worse. This further supports the thesis: the hyperparameters matter, the algorithm doesn't.
One hyperparameter the paper doesn't explore deeply is the choice of context representation. All methods in the paper use bag-of-words context: the context of a word is the set of words within a window, ignoring order. But you could also use:
Levy et al. focus on the simplest bag-of-words context because it's the standard in both count-based and prediction-based methods. But the context representation choice is another axis of variation that can affect results — and it's orthogonal to the algorithm choice.
Word2Vec's negative sampling draws "noise" words from Pn(w) = P(w)0.75 / Z, where P(w) is the unigram frequency and Z normalizes. Why 0.75 specifically?
The exponent α = 0.75 smooths the distribution. With α = 1 (no smoothing), very frequent words dominate the noise distribution — "the" appears as a negative sample far too often, while rare words like "scepter" almost never appear. With α = 0 (uniform), every word is equally likely as a negative — but this wastes many samples on extremely rare words that are uninformative negatives.
α = 0.75 is a compromise. Mikolov et al. reported tuning this value on a validation set. Levy et al. show that applying the same α = 0.75 to count-based CDS gives nearly identical improvements. The "magic number" is not about the neural training procedure — it's about the statistical properties of natural language frequency distributions.
python # Computing smoothed PPMI with context distribution smoothing def compute_smoothed_ppmi(cooccurrence, alpha=0.75): """PPMI with context distribution smoothing (CDS).""" V = cooccurrence.shape[0] row_sums = cooccurrence.sum(axis=1) # #(w) col_sums = cooccurrence.sum(axis=0) # #(c) total = cooccurrence.sum() # D # Smooth context distribution: #(c)^alpha smoothed_cols = col_sums ** alpha smoothed_total = smoothed_cols.sum() # PMI_alpha(w, c) = log(#(w,c) * D_alpha / (#(w) * #(c)^alpha)) ppmi = np.zeros((V, V)) for i in range(V): for j in range(V): if cooccurrence[i, j] > 0: pmi = np.log( cooccurrence[i, j] * smoothed_total / (row_sums[i] * smoothed_cols[j]) ) ppmi[i, j] = max(0, pmi) return ppmi
The paper also distinguishes between "dirty" and "clean" subsampling. In Word2Vec's implementation, subsampling is applied during corpus processing. This means the context window can span across removed words. If "the" is removed from "the cat sat," then "cat" and "sat" are now at distance 1 instead of distance 2. This dirty subsampling effectively enlarges the context window for content words.
Clean subsampling removes words first, then builds the co-occurrence matrix on the remaining corpus. The original positional distances are preserved. Surprisingly, both approaches work similarly well. The paper tests both and finds dirty subsampling is slightly better for analogy tasks and clean subsampling is slightly better for similarity — but the difference is small.
This is another instance of the paper's central theme: a design choice that seems fundamental (dirty vs. clean) turns out to matter much less than the bigger decisions (subsampling vs. no subsampling).
Window size L controls what kind of similarity the embeddings capture. This is one of the paper's most practically useful findings:
Small window (L = 2):
Large window (L = 5–10):
The linguistic distinction between paradigmatic and syntagmatic relations dates back to Saussure (1916). Paradigmatic relations hold between words that can substitute for each other ("I like cats/dogs/birds"). Syntagmatic relations hold between words that co-occur ("cat chased mouse"). These are fundamentally different types of semantic relationship, and the window size determines which type dominates your embeddings.
This finding has direct practical implications: if you need embeddings for semantic similarity (paraphrase detection, synonym expansion), use L = 2. If you need embeddings for topic modeling or information retrieval, use L = 5+. There is no universally optimal window size; it depends on your application.
The paper also reports important interaction effects between hyperparameters. Some combinations work better than expected, and others worse:
These interactions mean you cannot simply pick the "best" value for each hyperparameter independently. You need to consider the full configuration. This is why the 512-configuration factorial design was necessary.
But Levy et al.'s key finding is that these differences are small compared to the hyperparameter effects. Whether you factorize optimally (SVD) or approximately (SGD) matters less than what you factorize (which hyperparameters you use to construct the matrix).
The left shows the PMI matrix. Adjust k (negative samples) to see how the shifted PMI matrix changes — this is what Skip-gram implicitly factorizes.
This is where the paper makes its most impactful contribution. Levy et al. identified the specific hyperparameters that drive performance, independent of whether you use a neural or count-based method. Each hyperparameter, when transferred from Word2Vec to count-based methods, closes the performance gap.
Word2Vec aggressively subsamples frequent words before building any co-occurrence statistics. Words appearing with probability p(w) are kept with probability:
where t is a threshold (typically 10−5). This removes a huge fraction of "the," "a," "is" tokens. The effect: the effective context window grows (because function words that separated content words are removed), and content words get more direct neighbors.
Let's work through the math. Consider the word "the" which appears with probability p("the") = 0.05 in English. With threshold t = 10−5:
Only 1.4% of "the" tokens survive! For a content word like "algorithm" with p("algorithm") = 10−6:
All instances of "algorithm" survive. The effect on the corpus is dramatic: 98.6% of "the" tokens are removed, leaving content words with direct contact to each other. The effective context window for content words nearly doubles, because the function words that used to separate them are gone.
This single preprocessing step typically improves word similarity scores by 5–10% across all methods. It's the single most impactful hyperparameter in the entire study, and it was never applied to count-based methods before this paper.
Adjust the threshold t to see which words survive subsampling. Function words (the, a, is) are aggressively removed; content words survive intact.
Standard count-based methods use a fixed context window of size L: all context words within L positions get equal weight. Word2Vec uses a dynamic window: for each word, it samples a window size uniformly from 1 to L. Closer words are always counted; distant words are counted only sometimes.
This is equivalent to distance-weighted co-occurrence counting: a word at distance d gets weight (L − d + 1) / L. Nearby words matter more than far-away words — which makes linguistic sense.
To see why, consider the phrase "the big red ball bounced." With a fixed window of 5, "the" and "bounced" get the same co-occurrence weight as "big" and "red." But "big red" is a genuine collocation; "the...bounced" is coincidental proximity. The dynamic window, by sometimes using a size of 1 or 2, gives more weight to immediate neighbors.
The math: if the window is sampled uniformly from {1, 2, ..., L}, then a word at distance d is included whenever the sampled window ≥ d. The probability is (L − d + 1)/L. So for L = 5: distance 1 gets weight 1.0, distance 2 gets 0.8, distance 3 gets 0.6, distance 4 gets 0.4, distance 5 gets 0.2. This harmonic weighting emerges naturally from the uniform sampling.
We saw this in Chapter 2. Word2Vec's negative sampling uses P(c)0.75 instead of P(c). This is equivalent to modifying the PMI computation. Applying the same smoothing to count-based PMI gives a substantial boost.
Skip-gram with k negative samples implicitly shifts PMI by −log(k). This acts as a threshold: weak associations (PMI < log(k)) become negative and get clamped to zero by the limited-capacity embedding. For count-based methods, you can achieve the same effect explicitly:
Two more tricks that help: (a) using d = 300–500 dimensions (not too few, not too many), and (b) after factorization, adding the word vector and the context vector: ŵ = w + c. Word2Vec typically discards context vectors, but Pennington et al. (GloVe, 2014) showed that adding them helps. This works for SVD too.
Toggle each hyperparameter on/off to see its marginal impact on word similarity performance (SimLex-999). Start with all off (baseline count-based) and add them one by one.
Now for the big comparison. Levy et al. tested all methods on multiple word similarity and analogy benchmarks, with all hyperparameters systematically controlled. Here are the methods:
| Method | Type | What it does |
|---|---|---|
| PPMI | Count | PPMI matrix, full-dimensional (no SVD) |
| SVD | Count | PPMI + truncated SVD to d dimensions |
| SGNS | Predict | Skip-gram with negative sampling (Word2Vec) |
| GloVe | Hybrid | Weighted least-squares on log co-occurrence |
Each method was tested with every combination of the hyperparameters from Chapter 5: subsampling (yes/no), dynamic window (yes/no), CDS (yes/no), dirty/clean subsampling, and various dimensionalities.
Specific findings across benchmarks:
The results are nuanced — different tasks favor different configurations:
The paper provides a detailed ranking of hyperparameter impact:
Compare methods with matching hyperparameters. Toggle between benchmarks. The key insight: with the same tuning, methods converge.
python # The full pipeline that matches Word2Vec, from Levy et al. def levy_pipeline(corpus, d=300, window=5, alpha=0.75, p=0.5, sub_t=1e-5): # Step 1: Subsample frequent words (like Word2Vec) corpus = subsample(corpus, threshold=sub_t) # Step 2: Count co-occurrences with dynamic window counts = count_cooccurrences(corpus, window=window, weighted=True) # Step 3: Compute PPMI with context distribution smoothing ppmi = compute_ppmi(counts, alpha=alpha) # Step 4: Truncated SVD U, S, Vt = svds(ppmi, k=d) # Step 5: Eigenvalue weighting p=0.5 W = U @ np.diag(S ** p) C = Vt.T @ np.diag(S ** p) # Step 6: Add word + context vectors vectors = W + C return vectors
Distributional Semantics (Firth, 1957; Harris, 1954): The foundational idea that meaning is determined by context. "You shall know a word by the company it keeps."
Latent Semantic Analysis (Deerwester et al., 1990): Applied SVD to term-document matrices for information retrieval. Levy et al. extended this to word-context matrices with modern tricks.
Word2Vec (Mikolov et al., 2013): The Skip-gram and CBOW models that triggered the neural embeddings revolution. Levy et al. showed its innovations were portable to count-based methods.
Neural Word Embedding as Implicit Matrix Factorization (Levy & Goldberg, 2014): The authors' own prior paper proving W · CT = PMI − log(k). This theoretical result motivated the current paper's experiments.
GloVe re-evaluation: Pennington et al. (2014) claimed GloVe outperformed Word2Vec. Levy et al. showed this was largely due to GloVe using W+C vectors while Word2Vec used only W. When controlled, the gap shrinks to near zero. This is a case study in the paper's meta-lesson: the "improvement" was an implementation detail, not an algorithmic advance.
fastText (Bojanowski et al., 2017): Incorporated subword information into Word2Vec. The hyperparameter awareness from Levy et al. influenced the careful ablation studies in fastText. Bojanowski et al. systematically varied window size, dimensionality, and subsampling — a direct inheritance from the Levy et al. methodology.
Evaluation methodology: This paper set the standard for fair comparison in NLP: control all confounders, test all combinations, report variance. It changed how the community evaluates embedding methods.
Contextual word representations: The paper's findings indirectly contributed to the shift toward contextual representations (ELMo, BERT). If static word vectors' quality depends mostly on hyperparameters rather than the method, and if no method consistently wins, perhaps the entire paradigm of static word vectors has reached a ceiling. The natural next step: make representations context-dependent. Within three years of this paper, ELMo (2018) and BERT (2018) made static word embeddings largely obsolete for downstream tasks.
As of 2025, this paper has been cited over 3,000 times. It fundamentally changed the NLP community's approach to:
The paper's influence extends beyond NLP. In computer vision, the same lesson applies: is ResNet better than VGG because of skip connections, or because of batch normalization, learning rate scheduling, and data augmentation? In RL, is PPO better than TRPO because of clipping, or because of larger batch sizes? Levy et al. taught the ML community to ask these questions systematically.
As of 2025, the paper's central insight — that engineering choices often matter more than algorithms — has been vindicated repeatedly across the ML landscape:
Levy et al.'s 2015 paper was ahead of its time. The word embedding community learned the lesson in 2015. The broader ML community is still learning it in 2025.
Omer Levy and Yoav Goldberg wrote in their earlier paper: "Word2Vec is not magic." This follow-up paper proved it. The magic was never in the neural network. It was in the data processing pipeline — in decisions about which words to count, how to weight co-occurrences, and how to handle the overwhelming dominance of frequent words. These decisions were made by talented engineers at Google, tested on internal benchmarks, and bundled into a software release. The NLP community then attributed the improvements to the neural architecture, because that was the novel and exciting part.
The paper's enduring contribution is a methodology and a mindset: when comparing methods, compare fairly. When claiming improvements, isolate the source. When building systems, tune the hyperparameters at least as carefully as you design the algorithm. In machine learning, as in life, the boring details often matter more than the flashy architecture.
The paper concludes with concrete guidelines for practitioners:
| Decision | Recommendation | Why |
|---|---|---|
| Algorithm | Either SGNS or SVD | No consistent winner — pick based on infrastructure |
| Preprocessing | Always subsample frequent words | Biggest single improvement for all methods |
| Window | L=2 for similarity, L=5 for analogies | Narrow captures substitutability; wide captures association |
| Dimensions | d=300–500 | Sweet spot; more adds noise, fewer loses signal |
| Context smoothing | α = 0.75 | Matches Word2Vec's implicit choice; robust across tasks |
| SVD weighting | p = 0.5 | Balance between equal and full eigenvalue weighting |
| Vector type | Use W + C (word + context vectors) | Consistent 2–5% improvement on all tasks |
This insight has aged remarkably well. In the deep learning era, we see the same pattern repeat: BERT vs. GPT-2 vs. XLNet differences shrink when you control for compute and data. Scaling laws show that architecture matters less than scale. The lesson from 2015 word embeddings generalizes to 2020s language models: it's almost never the architecture.
| Year | Paper | Claim |
|---|---|---|
| 2003 | Bengio et al. | Neural language models learn useful word representations |
| 2010 | Turney & Pantel | Survey of count-based distributional methods (VSMs) |
| 2013 | Mikolov et al. | Word2Vec: prediction-based methods dominate count-based |
| 2014 | Pennington et al. | GloVe: hybrid method outperforms both |
| 2014 | Levy & Goldberg | Skip-gram implicitly factorizes PMI — theoretical link |
| 2015 | Levy et al. | It's the hyperparameters, not the algorithm |
| 2016 | Arora et al. | Random walk theory explains why PMI-based embeddings work |
| 2017 | Bojanowski et al. | fastText: subword embeddings, careful ablation following Levy |
The paper sits at a pivotal point in this timeline. Before it, the field was caught in a "neural vs. traditional" narrative. After it, the field understood that the distinction is largely meaningless — both approaches manipulate the same underlying co-occurrence statistics.
As hyperparameters are equalized between methods, performance converges. Watch how adding each trick to count-based methods closes the gap with SGNS.