Three extensions that made Word2Vec practical at scale: negative sampling replaces the expensive softmax, subsampling discards frequent words, and phrase detection treats "New York" as a single token.
The original Word2Vec paper showed that word vectors learned by Skip-gram produce stunning results — king − man + woman ≈ queen. But there was a computational elephant in the room: the softmax over the entire vocabulary.
Recall the Skip-gram probability:
The numerator is cheap: one dot product. The denominator is catastrophic: V dot products, V exponentials, and a sum. For V = 1 million words, every single training example requires 1 million dot products. With a corpus of 6 billion words and window size c = 5, that is ~60 billion training pairs, each requiring ~1 million operations. This is 6 × 1016 floating-point operations — utterly infeasible.
The first paper introduced hierarchical softmax (O(log V) per step) as a solution. But this follow-up paper presents something even better for Skip-gram: negative sampling. Instead of computing a probability over all V words, you only update a small number of "negative" (incorrect) words alongside the correct one.
Each training step requires scoring ALL V words. With negative sampling, you only score k+1 words (1 positive + k negatives). Drag k to see the cost reduction.
Let's put the cost in concrete terms. Training Skip-gram with full softmax on 6 billion words:
| Step | Operations | Time estimate |
|---|---|---|
| Per training pair | V × d = 106 × 300 = 3 × 108 FLOPs | ~1 ms |
| Per word position | 2c pairs × 3 × 108 = 3 × 109 FLOPs | ~10 ms |
| Full corpus | 6 × 109 positions × 3 × 109 = 1.8 × 1019 FLOPs | ~570 years on 1 CPU core |
With negative sampling (k=5): each training pair costs (k+1) × d = 1800 FLOPs. The full corpus processes in 6 × 109 × 10 × 1800 = 1.08 × 1014 FLOPs ≈ ~1 day on a multi-core machine. That is a factor of 166,000x reduction in compute.
The field explored three solutions, each with different tradeoffs:
| Method | Cost per step | Exact? | Best for |
|---|---|---|---|
| Full softmax | O(V · d) | Yes | Small vocabularies only |
| Hierarchical softmax | O(log V · d) | Yes | CBOW, moderate vocab |
| NCE | O(k · d) | Approx (converges) | Density estimation |
| Negative sampling | O(k · d) | No (but better vectors) | Skip-gram, large data |
The paper showed that NEG — the least theoretically rigorous method — produces the best word vectors. Sometimes practical performance matters more than theoretical guarantees.
Before we get to negative sampling, we need to understand the idea it simplifies: Noise Contrastive Estimation (NCE), introduced by Gutmann and Hyvarinen (2010).
The core idea of NCE is to convert a density estimation problem (estimating P(w|context)) into a binary classification problem: given a word-context pair, is this a real pair from the data, or a fake pair generated by a noise distribution?
We have two sources of word-context pairs:
For each real pair, we generate k noise samples. So the ratio of real to noise examples is 1 : k.
We train a binary classifier to distinguish real from noise. Using Bayes' rule, the posterior probability that a pair is real (given the pair and the mixture ratio) is:
NCE parameterizes log P(w|c) with a model (our dot product: v'w · vc) and maximizes the log-likelihood of correctly classifying real vs. noise pairs.
NCE is statistically consistent: it approximates the true softmax distribution. As k increases, the approximation improves. Gutmann and Hyvarinen showed that for k = V (one noise sample per vocabulary word), NCE recovers the exact maximum likelihood estimate. But the whole point is to use k ≪ V.
Let's write the full NCE loss for a single (center, context) pair. Denote the model's log-probability as s(w, c) = v'w · vc (the unnormalized score). NCE treats the normalization constant Z as a learnable parameter (or fixes it to 1). The posterior is:
The loss for one positive pair and k noise samples is:
Notice that NCE needs to compute log Pn(w) for each noise sample inside the posterior. This is an extra lookup step. Negative sampling drops this entirely.
For each real (center, context) pair, generate k noise pairs by sampling random words. Train a classifier to distinguish them. More noise samples = better approximation.
NCE is theoretically beautiful but carries overhead: it needs to compute the noise distribution probability Pn(w) and the posterior ratio at each step. Mikolov et al. asked: for learning word vectors, do we even need the full NCE guarantee?
The answer: no. Since we only care about vector quality (not recovering the exact probability distribution), we can simplify NCE further. The result is Negative Sampling (NEG) — a stripped-down variant that works spectacularly well in practice.
NCE's classifier computes:
Negative sampling simplifies this by dropping the log(k · Pn(w)) correction term:
This is a raw sigmoid on the dot product. No noise distribution correction. No normalization constant. Just: "are these two vectors compatible?"
The correction term log(k · Pn(w)) compensates for the fact that frequent words appear more often as noise samples. Dropping it means the model won't recover the exact softmax probabilities. But for word vector quality, this doesn't matter. The embeddings are defined by the relative geometry of vectors, not by the exact probabilities. Negative sampling learns the same geometry as full softmax — it just doesn't guarantee the same probabilities.
Both NCE and NEG need a noise distribution Pn(w) to sample negative words from. The paper found that the unigram distribution raised to the 3/4 power works best:
where f(w) is the frequency of word w in the corpus.
Why 3/4? Raw unigram sampling (exponent 1) samples too many frequent words like "the," "a," "is." Uniform sampling (exponent 0) samples too many rare words. The 3/4 power is a sweet spot that slightly over-represents rare words relative to their frequency, giving the model more informative negatives.
Compare sampling probabilities under different exponents α. At α=3/4, rare words get boosted relative to raw frequency, giving more informative negatives.
python import numpy as np def build_noise_distribution(word_freqs, alpha=0.75): """ word_freqs: array of word frequencies, shape (V,) alpha: exponent (0.75 = paper's recommended value) Returns: sampling probabilities, shape (V,) """ # Raise frequencies to the 3/4 power powered = word_freqs ** alpha # shape: (V,) # Normalize to get a probability distribution noise_dist = powered / powered.sum() return noise_dist # Example: "the" is 100x more frequent than "queen" freqs = np.array([1000, 500, 200, 50, 10]) # the, of, cat, queen, zinc words = ["the", "of", "cat", "queen", "zinc"] raw = freqs / freqs.sum() # [0.568, 0.284, 0.114, 0.028, 0.006] neg = build_noise_distribution(freqs) # [0.425, 0.259, 0.140, 0.058, 0.018] — rare words boosted for w, r, n in zip(words, raw, neg): print(f"{w:8s} raw: {r:.3f} neg: {n:.3f} boost: {n/r:.2f}x")
Now we can write the full negative sampling objective. This is the equation that replaced the softmax and made Word2Vec trainable on billions of words.
For a (center word wI, context word wO) pair, the NEG objective is:
Let's unpack each term:
Term 1: log σ(v'wO · vwI) — the positive term. The sigmoid σ(x) = 1/(1 + exp(−x)). This is high when v'wO and vwI point in similar directions (large positive dot product). The loss encourages the output vector of the true context word to be close to the input vector of the center word.
Term 2: ∑ log σ(−v'wi · vwI) — the negative terms. Note the minus sign inside the sigmoid. σ(−x) = 1 − σ(x). This is high when v'wi and vwI point in different directions (small or negative dot product). The loss encourages the output vectors of noise words to be far from the input vector of the center word.
The gradient with respect to the output vector v'w is remarkably simple:
For the positive word: the gradient is proportional to (prediction − 1) · input vector. It pushes v'wO toward vwI. For each negative word: the gradient is proportional to (prediction − 0) · input vector. It pushes v'wi away from vwI.
For V = 1,000,000 and k = 5: that is a 200,000x speedup. The paper recommends k = 5-20 for large datasets and k = 5-50 for small datasets.
Let d = 3, center = "sat" with vsat = [0.5, −0.2, 0.3]. True context = "cat" with v'cat = [0.1, 0.4, −0.1]. Negative sample = "tree" with v'tree = [−0.3, 0.1, 0.6].
Positive term:
Gradient: (σ(−0.06) − 1) · vsat = −0.515 · [0.5, −0.2, 0.3] = [−0.258, 0.103, −0.155]. This pushes v'cat toward vsat.
Negative term:
Gradient: σ(0.01) · vsat = 0.5025 · [0.5, −0.2, 0.3] = [0.251, −0.101, 0.151]. This pushes v'tree away from vsat.
Total loss: L = 0.724 + 0.698 = 1.422. After updating with learning rate 0.1, v'cat moves 0.026 closer to vsat and v'tree moves 0.025 away.
For center word "sat", the true context "cat" is pushed closer (green arrows), while k random negative words are pushed away (red arrows). Click "Train Step" to see an update.
python import numpy as np def neg_sampling_loss(center_id, context_id, neg_ids, W_in, W_out): """ center_id: int — center word context_id: int — true context word (positive) neg_ids: list of int — k negative samples W_in: (V, d) — input embeddings W_out: (V, d) — output embeddings Returns: loss, gradient for W_in[center_id], gradients for W_out rows """ v_c = W_in[center_id] # (d,) — center embedding # Positive term: push context word toward center v_pos = W_out[context_id] # (d,) dot_pos = v_pos @ v_c # scalar sig_pos = 1 / (1 + np.exp(-dot_pos)) loss = -np.log(sig_pos + 1e-10) # Gradient for positive output vector grad_pos = (sig_pos - 1) * v_c # (d,) — points toward center grad_center = (sig_pos - 1) * v_pos # Negative terms: push noise words away from center for neg_id in neg_ids: v_neg = W_out[neg_id] # (d,) dot_neg = v_neg @ v_c # scalar sig_neg = 1 / (1 + np.exp(-dot_neg)) loss -= np.log(1 - sig_neg + 1e-10) # Gradient: push this negative word AWAY from center grad_neg = sig_neg * v_c # (d,) W_out[neg_id] -= lr * grad_neg grad_center += sig_neg * v_neg # Update W_out[context_id] -= lr * grad_pos W_in[center_id] -= lr * grad_center return loss # Total cost per step: (k+1) dot products + (k+1) sigmoid operations # For k=5, d=300: 6 * 300 = 1,800 FLOPs vs. 1,000,000 * 300 for softmax
python import torch import torch.nn as nn import numpy as np from collections import Counter class Word2VecDataset(torch.utils.data.Dataset): def __init__(self, corpus, vocab, window=5, k=5, t=1e-5): """ corpus: list of list of int (tokenized sentences) vocab: dict mapping word -> index """ self.k = k self.pairs = [] # Build frequency table for subsampling + noise dist freq = Counter() total = 0 for sent in corpus: for w in sent: freq[w] += 1 total += 1 # Subsampling probabilities self.keep_prob = {} for w, c in freq.items(): f = c / total self.keep_prob[w] = np.sqrt(t / f) if f > t else 1.0 # Noise distribution: f(w)^(3/4) freqs = np.array([freq.get(i, 0) for i in range(len(vocab))]) powered = freqs ** 0.75 self.noise_dist = powered / powered.sum() # Generate training pairs with subsampling for sent in corpus: # Apply subsampling filtered = [w for w in sent if np.random.random() < self.keep_prob.get(w, 1)] for i, center in enumerate(filtered): for j in range(max(0, i-window), min(len(filtered), i+window+1)): if i != j: self.pairs.append((center, filtered[j])) def __len__(self): return len(self.pairs) def __getitem__(self, idx): center, context = self.pairs[idx] # Sample k negatives from noise distribution negatives = np.random.choice(len(self.noise_dist), size=self.k, p=self.noise_dist) return (torch.tensor(center), torch.tensor(context), torch.tensor(negatives, dtype=torch.long)) class SkipGramNEG(nn.Module): def __init__(self, V, d=300): super().__init__() self.W_in = nn.Embedding(V, d) self.W_out = nn.Embedding(V, d) nn.init.uniform_(self.W_in.weight, -0.5/d, 0.5/d) nn.init.zeros_(self.W_out.weight) def forward(self, center, context, negatives): v_c = self.W_in(center) # (B, d) v_pos = self.W_out(context) # (B, d) v_neg = self.W_out(negatives) # (B, k, d) pos_score = (v_c * v_pos).sum(1) # (B,) pos_loss = -torch.nn.functional.logsigmoid(pos_score) neg_score = torch.bmm(v_neg, v_c.unsqueeze(2)).squeeze(2) # (B, k) neg_loss = -torch.nn.functional.logsigmoid(-neg_score).sum(1) return (pos_loss + neg_loss).mean() # Training loop model = SkipGramNEG(V=10000, d=300) optimizer = torch.optim.SparseAdam(model.parameters(), lr=0.025) for epoch in range(5): for center, context, negatives in dataloader: loss = model(center, context, negatives) optimizer.zero_grad() loss.backward() optimizer.step() # Extract word vectors word_vectors = model.W_in.weight.detach().numpy() # (V, 300)
SparseAdam exploits this sparsity for faster training. Standard Adam would update all V rows due to how momentum works, wasting computation on untouched embeddings.The original Word2Vec implementation uses a linear learning rate decay: the learning rate starts at 0.025 and decreases linearly to near zero over the course of training. This is unusual — most modern systems use constant LR with warm-up, or cosine decay. But linear decay works well for Word2Vec because:
python # Linear learning rate decay in Word2Vec def get_lr(current_word, total_words, initial_lr=0.025, min_lr=0.0001): """Linear decay from initial_lr to min_lr over the corpus.""" progress = current_word / total_words return max(min_lr, initial_lr * (1 - progress)) # At the start: lr = 0.025 (big steps, fast learning) # At 50% done: lr = 0.0125 (moderate steps) # At 99% done: lr = 0.00025 (tiny adjustments)
The original C implementation uses Hogwild-style asynchronous SGD: multiple threads read from the same corpus file at different positions and update the shared embedding matrices without locks. This sounds chaotic — won't threads overwrite each other's updates? In practice, it works because:
The original C implementation stores word vectors as a contiguous float array of size V × d. Each vector lookup is pointer arithmetic: syn0 + word_id * d. No hash table, no indirection. For V = 1M and d = 300 in float32, the two embedding matrices total 2 × 1M × 300 × 4 = 2.4 GB — comfortably fitting in RAM on 2013 hardware.
Modern PyTorch implementations use nn.Embedding which has the same memory layout but adds autograd bookkeeping. For maximum training speed, the original C code outperforms PyTorch by ~3-5x due to avoiding Python overhead and using raw BLAS operations.
The original implementation by Mikolov is a single C file of ~650 lines. Key design decisions:
exp() at runtime. Lookups are faster than computing exp().c // Simplified core of word2vec.c (Skip-gram with NEG) // The actual training loop for one (center, context) pair // Positive example: push context toward center for (c = 0; c < layer1_size; c++) neu1e[c] = 0; for (d = 0; d < negative + 1; d++) { if (d == 0) { target = word; // positive: true context word label = 1; } else { target = table[rand() % table_size]; // O(1) negative sample label = 0; } l2 = target * layer1_size; // pointer to output vector f = 0; for (c = 0; c < layer1_size; c++) f += syn0[l1 + c] * syn1neg[l2 + c]; // dot product g = (label - expTable[sigmoid_index(f)]) * alpha; // gradient for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[l2 + c]; // accumulate input gradient for (c = 0; c < layer1_size; c++) syn1neg[l2 + c] += g * syn0[l1 + c]; // update output vector } // Update input vector for (c = 0; c < layer1_size; c++) syn0[l1 + c] += neu1e[c];
In a typical English corpus, words like "the," "a," "in," and "is" appear millions of times. The word "the" might account for 7% of all tokens. This creates two problems for Word2Vec:
The paper's solution: randomly discard frequent words during training, with a probability that depends on word frequency.
Each word w in the training corpus is discarded with probability:
where f(w) is the frequency of word w (fraction of total tokens) and t is a threshold, typically t = 10−5.
Consider a word with frequency f(w) = 0.01 (1% of corpus, like "of") and threshold t = 10−5:
So 96.8% of occurrences of "of" are discarded! Only 3.2% survive.
For a word with frequency f(w) = 10−5 (rare word, same as threshold):
Rare words are never discarded.
The bar chart shows how many occurrences survive subsampling. Frequent words (left) are aggressively pruned; rare words (right) are untouched. Drag the threshold t.
python import numpy as np def subsample(word_freqs, threshold=1e-5): """ word_freqs: dict mapping word -> frequency (fraction of total tokens) Returns: dict mapping word -> keep probability """ keep_probs = {} for word, freq in word_freqs.items(): if freq > threshold: keep_probs[word] = np.sqrt(threshold / freq) else: keep_probs[word] = 1.0 # rare words: always keep return keep_probs # Example freqs = {"the": 0.07, "of": 0.03, "cat": 0.0001, "queen": 0.00001} keep = subsample(freqs) # the: 1.2% kept, of: 1.8% kept, cat: 31.6% kept, queen: 100% kept
Subsampling helps in two ways simultaneously:
Original sentence: "the cat sat on the fluffy mat"
With t = 10−5, most occurrences of "the" and "on" are discarded. A likely surviving version:
After subsampling: "cat sat fluffy mat"
Now with window c = 2, the context of "sat" is {cat, fluffy, mat} instead of {the, cat, on, the}. Every context word is meaningful. "The" and "on" added only noise — they co-occur with everything and carry no discriminative information.
Frequent words like "the" and "on" are removed, leaving only informative context words. The effective window now connects content words that were previously separated by function words.
Subsampling has an additional hidden benefit: it effectively increases the context window. When frequent words are removed, distant words that were outside the original window now become visible. If "the" between "cat" and "fluffy" is removed, those words become adjacent, effectively making c larger for rare, informative words.
The subsampling formula P(discard) = 1 − √(t/f(w)) was chosen empirically, but it has an intuitive derivation. We want a keep probability that:
The √(t/f) formula approximately satisfies the last condition. If word w appears N · f(w) times in a corpus of N tokens, after subsampling it appears approximately:
For very frequent words (f ≫ t), the surviving count grows as √f — much slower than linear f. For a word appearing 107 times, only about 103.5 ≈ 3,162 occurrences survive. This is still enough to learn a good vector, but doesn't waste compute on redundant examples.
python # Visualizing the effect of subsampling on a Zipfian distribution import numpy as np # Zipf's law: frequency of rank-r word ≈ C/r V = 100000 ranks = np.arange(1, V+1) freqs = 1.0 / ranks freqs = freqs / freqs.sum() # normalize to probabilities t = 1e-5 keep_probs = np.sqrt(t / freqs) keep_probs = np.clip(keep_probs, 0, 1) # Top 10 words: original vs. effective frequency for i in range(10): print(f"Rank {i+1}: freq={freqs[i]:.6f} keep={keep_probs[i]:.4f} effective={freqs[i]*keep_probs[i]:.6f}") # Output shows that rank-1 word goes from 8.3% to 0.03% of effective corpus # while rank-1000 word goes from 0.008% to 0.003% — much less dramatic
Word2Vec treats every token as atomic, but many natural concepts span multiple words. "New York" is not the sum of "New" and "York." "Ice cream" has nothing to do with "ice" or "cream" individually. "Machine learning" is a concept, not two separate words.
The paper introduces a simple, elegant algorithm to automatically detect phrases in the corpus and treat them as single tokens.
For each pair of adjacent words (wi, wj), compute a score based on pointwise mutual information (PMI):
Where:
If "New" appears 500,000 times and "York" appears 200,000 times in a corpus of 1 billion words, random chance would predict them appearing together roughly (500,000/109) × (200,000/109) × 109 = 100 times. If "New York" actually appears 150,000 times, it's occurring 1,500x more often than chance predicts. That is a phrase.
Meanwhile, "this is" might appear 500,000 times, but "this" appears 10 million times and "is" appears 8 million times. By chance alone, we'd expect (10M/109) × (8M/109) × 109 = 80,000. So "this is" is only ~6x above chance. Not a phrase.
The algorithm runs multiple passes. In the first pass, it detects bigrams ("New_York", "ice_cream"). In the second pass, it can detect trigrams by combining bigrams with adjacent words ("New_York_City"). In practice, 2-4 passes capture most useful phrases.
Corpus size: 1 billion words. Consider two candidate phrases:
"New York":
Score of 1,500 is far above any reasonable threshold. "New York" is definitely a phrase.
"this is":
Score of 6.25 is well below typical thresholds (δ ≈ 100). "This is" is just two common words that happen to co-occur.
The δ parameter in the numerator (count(wi, wj) − δ) serves as a minimum count threshold. If a bigram appears fewer than δ times, the numerator becomes negative and the pair is automatically rejected. This prevents statistical flukes from creating spurious phrases. The paper uses δ = 5 as a default.
Bigram scores based on pointwise mutual information. High scores (teal) indicate genuine phrases; low scores (dim) are coincidental word pairs. Drag the threshold to see which bigrams qualify.
python def detect_phrases(corpus, min_count=5, threshold=100, delta=5): """ corpus: list of list of str (tokenized sentences) Returns: corpus with phrases joined by '_' """ from collections import Counter # Count unigrams and bigrams unigrams = Counter() bigrams = Counter() total = 0 for sentence in corpus: for i, word in enumerate(sentence): unigrams[word] += 1 total += 1 if i < len(sentence) - 1: bigrams[(word, sentence[i+1])] += 1 # Score each bigram phrases = set() for (w1, w2), count in bigrams.items(): if count < min_count: continue score = (count - delta) / (unigrams[w1] * unigrams[w2]) * total if score > threshold: phrases.add((w1, w2)) # Replace bigrams with phrases new_corpus = [] for sentence in corpus: new_sent = [] i = 0 while i < len(sentence): if i < len(sentence) - 1 and (sentence[i], sentence[i+1]) in phrases: new_sent.append(sentence[i] + "_" + sentence[i+1]) i += 2 else: new_sent.append(sentence[i]) i += 1 new_corpus.append(new_sent) return new_corpus
Once phrases are detected and the corpus is preprocessed (replacing "New" "York" with "New_York"), we simply train Word2Vec as before. The phrase tokens are treated exactly like any other word. They get their own embedding vectors, learned from context just like single words.
The beautiful thing is that phrase vectors capture meaning that is not the sum of their parts. The vector for "New_York" is not simply vec("New") + vec("York"). It's a new vector that sits in its own location in the embedding space, near other city names and geographic entities.
Examples from the paper:
| Phrase | Nearest neighbors |
|---|---|
| New_York | New_York_City, Los_Angeles, Chicago, Manhattan |
| ice_cream | frozen_yogurt, gelato, dessert, sorbet |
| machine_learning | deep_learning, NLP, computer_vision, AI |
| Barack_Obama | Obama, president, White_House, Biden |
| hot_dog | hamburger, sandwich, sausage, pizza |
The analogy property extends to phrases:
This works because phrases, like words, live in the same vector space and are learned from the same distributional signal. A phrase's context defines its meaning, just like a word's context defines its meaning.
Running phrase detection on the Google News corpus (33B tokens):
| Pass | New phrases found | Example |
|---|---|---|
| 1st pass (bigrams) | ~500,000 | New_York, ice_cream, machine_learning |
| 2nd pass (trigrams) | ~200,000 | New_York_City, ice_cream_cone, New_York_Times |
| 3rd pass (4-grams) | ~50,000 | New_York_Stock_Exchange |
| 4th pass | ~10,000 | Diminishing returns |
The original vocabulary of ~1M words expanded to ~3M tokens (words + phrases). But only ~500K of these phrases appeared frequently enough to learn reliable vectors.
The paper discovered that phrase vectors support the same analogy arithmetic as word vectors:
These analogies work because phrases live in the same vector space as words and are trained with the same objective. The distributional hypothesis applies equally: "Montreal_Canadiens" appears in similar contexts to other hockey teams.
Phrases cluster near semantically related words and other phrases. "New_York" is near other cities, not near "new" or "york" individually. Click a cluster to highlight.
One of the most fascinating findings of the paper is about additive compositionality: the meaning of short phrases can sometimes be approximated by vector addition of their component words.
This seems to contradict what we just said about "New_York" ≠ vec("New") + vec("York"). So what's going on?
The paper distinguishes between non-compositional and compositional phrases:
| Type | Example | vec(phrase) ≈ sum of parts? |
|---|---|---|
| Non-compositional | "hot dog", "New York", "kick the bucket" | No — meaning unrelated to parts |
| Compositional | "strong coffee", "big city", "fast car" | Yes — meaning is roughly additive |
For compositional phrases, the paper shows that:
The paper offers an intuitive explanation rooted in the training objective. In Skip-gram with negative sampling, the training signal for a word is determined by its context. If two words w1 and w2 each independently predict a set of context words, then the vector v1 + v2 will have high dot product with context words of both w1 and w2.
The sum vec("Czech") + vec("currency") activates context words related to both Czech things and currency things. The word that best matches this intersection is "koruna" — the Czech currency. The addition acts as an AND operation in the semantic space.
Understanding when addition fails is as instructive as understanding when it works:
| Query | Expected | Actual nearest | Why it fails |
|---|---|---|---|
| not + good | bad | great | Negation is not additive; "not" doesn't negate in vector space |
| hot + dog | hot_dog | puppy | Idiomatic meaning unrelated to component words |
| red + herring | distraction | salmon | Metaphorical usage; literal meanings dominate |
| kick + bucket | die | pail | Idiom; literal combination wins |
The pattern: addition works for compositional meanings (Czech + currency = koruna) but fails for idiomatic, metaphorical, or negated meanings. This is a fundamental limitation of static word vectors — they encode distributional similarity, not compositional semantics.
In Skip-gram, the training objective maximizes the dot product between co-occurring words. If words A and B independently co-occur with a set of words S, then:
The sum vector vA + vB has high dot product with any word s that is in both S(A) and S(B):
This sum is large only when both terms are large — i.e., when s co-occurs with both A and B. This is an AND operation in the embedding space. The word whose context best matches the intersection of A's and B's contexts will have the highest similarity to vA + vB.
Why does vec("Czech") + vec("currency") ≈ vec("koruna")?
This only works because "koruna" is genuinely at the intersection of Czech-related and currency-related contexts. For non-compositional phrases like "hot dog," the intersection of hot-related and dog-related contexts gives "puppy" or "warm," not "hamburger."
The paper created a test set of ~3,000 compositional queries to evaluate phrase vectors:
| Category | Example query | Expected answer | Accuracy |
|---|---|---|---|
| Country + currency | Czech + currency | koruna | 62% |
| Country + capital | Vietnam + capital | Hanoi | 58% |
| Country + language | Japan + language | Japanese | 71% |
| City + state | Dallas + state | Texas | 49% |
| Person + field | Einstein + field | physics | 45% |
Country-language pairs work best (71% accuracy) because the relationship is consistent and well-represented in news text. City-state and person-field are harder because the associations are more ambiguous.
python # Testing compositionality def test_composition(word_a, word_b, expected, word_vectors, top_k=10): """Test if vec(a) + vec(b) ≈ vec(expected)""" target = word_vectors[word_a] + word_vectors[word_b] target = target / (np.linalg.norm(target) + 1e-10) # Find nearest neighbors to the sum sims = {} for word, vec in word_vectors.items(): if word in {word_a, word_b}: continue vec_norm = vec / (np.linalg.norm(vec) + 1e-10) sims[word] = np.dot(target, vec_norm) ranked = sorted(sims.items(), key=lambda x: -x[1])[:top_k] rank = None for i, (w, s) in enumerate(ranked): if w == expected: rank = i + 1 break print(f"{word_a} + {word_b} → {ranked[0][0]} (expected: {expected}, rank: {rank})") return rank == 1 # Correct if expected word is #1 # Example test_composition("Czech", "currency", "koruna", vectors) # Czech + currency → koruna (expected: koruna, rank: 1)
vec(A) + vec(B) ≈ vec(C). The sum of two word vectors lands near the word that captures their intersection. Try different pairs to see which compositions work.
The paper evaluates the three extensions (negative sampling, subsampling, phrases) on the word analogy task from the original Word2Vec paper, extended with a phrase analogy section.
| Method | Semantic Acc % | Syntactic Acc % | Total Acc % | Training Speed |
|---|---|---|---|---|
| Hierarchical Softmax | 53 | 55 | 54 | 1x |
| NCE (k=5) | 52 | 54 | 53 | ~1.3x |
| NEG (k=5) | 61 | 61 | 61 | ~1.5x |
| NEG (k=15) | 63 | 58 | 61 | ~1x |
Negative sampling with k=5 outperforms both hierarchical softmax and NCE on analogy accuracy, while being the fastest to train. The simpler objective (no correction terms, no tree) learns better vectors.
| Subsampling | Semantic Acc % | Syntactic Acc % | Speed |
|---|---|---|---|
| No subsampling | 56 | 54 | 1x |
| t = 10-5 | 61 | 61 | ~2.5x |
Subsampling improves both speed (2.5x) and accuracy (5+ points). This is the rare optimization that makes everything better.
The paper extends the analogy test to include phrases. Example phrase analogies:
The best model (Skip-gram, NEG k=5, subsampling t=10−5, phrase detection, d=300, trained on 33 billion words) achieved 72% accuracy on word analogies and 66% on phrase analogies.
Negative sampling achieves the best accuracy-speed tradeoff. Bubble size represents training speed (larger = faster).
The paper's findings on the optimal number of negative samples k:
| Dataset size | Recommended k | Reasoning |
|---|---|---|
| Small (<100M words) | 15-50 | Fewer data points; need more negatives per step to get sufficient gradient signal |
| Medium (100M-1B) | 5-15 | Moderate data; k=5-10 gives a good speed-quality tradeoff |
| Large (>1B words) | 2-5 | Abundance of data means each step can be cheaper; the sheer volume of steps compensates |
The intuition: with more data, you see more unique (center, context) pairs. Each pair teaches the model something. With less data, you see the same pairs repeatedly, so you need more negatives per pair to extract sufficient information from each step.
The paper tested different noise distribution exponents:
| Exponent α | Analogy accuracy % |
|---|---|
| 0 (uniform) | 52 |
| 0.25 | 56 |
| 0.5 | 58 |
| 0.75 | 61 |
| 1.0 (unigram) | 57 |
The 3/4 power consistently outperformed both uniform and pure unigram sampling. This value has since been adopted as a standard across many contrastive learning methods.
The paper's results allow us to estimate the contribution of each innovation:
| Configuration | Accuracy % | Contribution |
|---|---|---|
| Hier. softmax only | 54 | Baseline |
| + Negative sampling (k=5) | 61 | +7 points (better training signal) |
| + Subsampling (t=10−5) | 66 | +5 points (cleaner context) |
| + Phrase detection | 68 | +2 points (multi-word concepts) |
| + More data (33B vs 6B) | 72 | +4 points (scale always helps) |
Negative sampling and subsampling together contribute +12 points — more than all other improvements combined. This reinforces the paper's central message: getting the training dynamics right (what to predict, what to ignore) matters more than model architecture.
The paper found the method to be remarkably robust to hyperparameter choices:
The most sensitive hyperparameter is the learning rate and its decay schedule. Setting the initial LR too high causes divergence; too low causes slow convergence and poor vectors.
An interesting asymmetry: hierarchical softmax works best with CBOW, while negative sampling works best with Skip-gram. Why?
The two Word2Vec papers (Mikolov et al., 2013a, 2013b) are among the most cited papers in AI/NLP. Together they accumulated over 50,000 citations by 2023. Their impact went far beyond word embeddings:
The two Word2Vec papers introduced three training methods. Here is a detailed comparison of their mathematical properties:
| Property | Full Softmax | Hier. Softmax | Neg. Sampling |
|---|---|---|---|
| Objective | Cross-entropy over V | Product of log V sigmoids | 1 + k binary sigmoids |
| Exact probabilities? | Yes | Yes | No |
| Cost per step | O(V · d) | O(log V · d) | O(k · d) |
| Parameters updated | All V output vectors | log V node vectors | k + 1 output vectors |
| Good for rare words? | Yes (all words scored) | Depends on tree (longer paths for rare) | Yes (rare words appear as negatives) |
| Implicit factorization | Softmax matrix | Binary tree structure | Shifted PMI matrix |
| Best paired with | Small vocab only | CBOW | Skip-gram |
The paper's final recommendations for practitioners:
Beyond the analogy task, the paper evaluated phrase vectors on several other benchmarks:
SemEval 2012 Task 2: Measuring relational similarity (e.g., "mason is to stone as carpenter is to wood"). The phrase-enhanced model achieved higher Spearman correlation with human judgments than word-only models.
Qualitative analysis: The paper showed that phrase vectors cluster sensibly:
These clusters emerge entirely from the text — no knowledge base, no labeled data, no human annotation. The distributional hypothesis works for phrases just as it works for individual words.
Phrase detection and phrase vectors influenced subsequent work in several ways:
Looking back from 2024, negative sampling's most important contribution wasn't making Word2Vec fast — it was establishing contrastive learning as a paradigm. The insight that "learning by distinguishing positive from negative examples" is sufficient to learn useful representations has been verified in dozens of subsequent papers across NLP, vision, and multimodal AI.
Key lessons from negative sampling that generalize to modern contrastive learning:
The elegance of negative sampling is that it solves two problems at once: it makes training tractable (by avoiding the full softmax) and it provides a natural way to define what "similar" and "dissimilar" mean (through the noise distribution). This dual purpose is why the contrastive learning paradigm has proven so durable.
Word2Vec (Mikolov et al., 2013a): The original paper introducing CBOW and Skip-gram with hierarchical softmax. This follow-up provides the practical extensions that made Word2Vec the dominant word embedding method. See our companion veanor.
Noise Contrastive Estimation (Gutmann & Hyvarinen, 2010): The theoretical foundation for negative sampling. NCE provides the statistical guarantee; NEG simplifies it for the specific case of learning embeddings.
Pointwise Mutual Information: The phrase detection algorithm is based on PMI, a fundamental concept from information theory. Later work by Levy and Goldberg (2014) showed that Skip-gram with negative sampling implicitly factorizes a word-context PMI matrix.
GloVe (Pennington et al., 2014): Showed the connection between predictive methods (Word2Vec) and count-based methods (LSA). See our GloVe veanor.
FastText (Bojanowski et al., 2017): Extended Skip-gram with negative sampling to use character n-grams, enabling out-of-vocabulary word handling and better morphological representations.
Contrastive learning in deep learning: The negative sampling idea — learn by distinguishing positives from negatives — became foundational. SimCLR, MoCo, CLIP, and other contrastive methods all use the same principle: push positive pairs together and negative pairs apart, without computing a full normalization.
Word embeddings as features: Pre-trained Word2Vec/GloVe embeddings became the standard initialization for NLP models from 2013-2018, preceding the contextual embedding revolution (ELMo, BERT).
Levy and Goldberg (2014) proved a remarkable theorem about Skip-gram NEG: when trained to convergence, the word and context vectors satisfy:
where PMI(i, j) = log(P(i,j) / (P(i)P(j))) is the pointwise mutual information. This means Skip-gram NEG is implicitly factorizing a shifted PMI matrix. The "neural network" is doing matrix factorization in disguise.
This insight was hugely influential. It showed that the count-vs-predict debate was a false dichotomy — both approaches were optimizing closely related objectives. It also explained why word vectors have linear structure: PMI captures the log-odds of co-occurrence, and log-odds are additive by nature.
The negative sampling paradigm can be summarized as: "learn representations by distinguishing real data from noise." This pattern appears throughout modern deep learning:
| Method | Year | Positive pair | Negative pair |
|---|---|---|---|
| Word2Vec NEG | 2013 | (center word, context word) | (center word, random word) |
| SimCLR | 2020 | (image, augmented image) | (image, different image) |
| MoCo | 2020 | (query, matching key) | (query, queue of keys) |
| CLIP | 2021 | (image, matching text) | (image, wrong text) |
| DPO | 2023 | (prompt, preferred response) | (prompt, dispreferred response) |
Every row follows the same template: push positive pairs together, push negative pairs apart. Word2Vec NEG was the prototype.
Combining everything from both papers, the full Word2Vec training recipe is:
python # Complete Word2Vec pipeline using gensim from gensim.models import Word2Vec from gensim.models.phrases import Phrases, Phraser # Step 1: Load tokenized corpus corpus = [["new", "york", "is", "a", "big", "city"], ["ice", "cream", "is", "cold"], # ... millions more sentences] # Step 2: Phrase detection phrases = Phrases(corpus, min_count=5, threshold=100) phraser = Phraser(phrases) corpus_with_phrases = [phraser[sent] for sent in corpus] # ["new_york", "is", "a", "big", "city"] # Step 3-4: Train model = Word2Vec( corpus_with_phrases, vector_size=300, window=5, min_count=5, sg=1, # Skip-gram negative=5, # k = 5 negative samples sample=1e-5, # subsampling threshold workers=8, epochs=5, ) # Step 5: Save and use model.save("word2vec.model") model.wv.save_word2vec_format("vectors.bin", binary=True) # Use phrase vectors print(model.wv.most_similar("new_york")) # [('los_angeles', 0.82), ('chicago', 0.79), ('manhattan', 0.77), ...]