Word2Vec Negative Sampling

Chapter 0: The Bottleneck

The original Word2Vec paper showed that word vectors learned by Skip-gram produce stunning results — king − man + woman ≈ queen. But there was a computational elephant in the room: the softmax over the entire vocabulary.

Recall the Skip-gram probability:

P(w_O | w_I) = exp(v'_{w_O} · v_{w_I}) / ∑_j=1^V exp(v'_j · v_{w_I})

The numerator is cheap: one dot product. The denominator is catastrophic: V dot products, V exponentials, and a sum. For V = 1 million words, every single training example requires 1 million dot products. With a corpus of 6 billion words and window size c = 5, that is ~60 billion training pairs, each requiring ~1 million operations. This is 6 × 10¹⁶ floating-point operations — utterly infeasible.

The first paper introduced hierarchical softmax (O(log V) per step) as a solution. But this follow-up paper presents something even better for Skip-gram: negative sampling. Instead of computing a probability over all V words, you only update a small number of "negative" (incorrect) words alongside the correct one.

The Softmax Bottleneck

Each training step requires scoring ALL V words. With negative sampling, you only score k+1 words (1 positive + k negatives). Drag k to see the cost reduction.

Neg. samples k 5

The question this paper answers: Can we replace the softmax (which normalizes over all V words) with something that only touches a small constant number of words per step — without sacrificing vector quality? The answer is yes, via negative sampling. And while we're at it, can we also handle phrases ("New York", "ice cream") as single tokens? Also yes, via a clever statistical detection algorithm.

Timeline of the problem

Let's put the cost in concrete terms. Training Skip-gram with full softmax on 6 billion words:

Step	Operations	Time estimate
Per training pair	V × d = 10⁶ × 300 = 3 × 10⁸ FLOPs	~1 ms
Per word position	2c pairs × 3 × 10⁸ = 3 × 10⁹ FLOPs	~10 ms
Full corpus	6 × 10⁹ positions × 3 × 10⁹ = 1.8 × 10¹⁹ FLOPs	~570 years on 1 CPU core

With negative sampling (k=5): each training pair costs (k+1) × d = 1800 FLOPs. The full corpus processes in 6 × 10⁹ × 10 × 1800 = 1.08 × 10¹⁴ FLOPs ≈ ~1 day on a multi-core machine. That is a factor of 166,000x reduction in compute.

Three approaches to the bottleneck

The field explored three solutions, each with different tradeoffs:

Method	Cost per step	Exact?	Best for
Full softmax	O(V · d)	Yes	Small vocabularies only
Hierarchical softmax	O(log V · d)	Yes	CBOW, moderate vocab
NCE	O(k · d)	Approx (converges)	Density estimation
Negative sampling	O(k · d)	No (but better vectors)	Skip-gram, large data

The paper showed that NEG — the least theoretically rigorous method — produces the best word vectors. Sometimes practical performance matters more than theoretical guarantees.

Why is the full softmax denominator the bottleneck in Skip-gram training?

It requires computing dot products with ALL V output vectors (e.g., 1 million), at every training step, for every (center, context) pair — making the total cost proportional to V × training pairs The exponential function overflows The embedding lookup is slow

Chapter 1: Noise Contrastive Estimation

Before we get to negative sampling, we need to understand the idea it simplifies: Noise Contrastive Estimation (NCE), introduced by Gutmann and Hyvarinen (2010).

The core idea of NCE is to convert a density estimation problem (estimating P(w|context)) into a binary classification problem: given a word-context pair, is this a real pair from the data, or a fake pair generated by a noise distribution?

The setup

We have two sources of word-context pairs:

Data distribution: Real (center, context) pairs from the corpus. Labeled as positive (D = 1).
Noise distribution: Fake pairs where the context word is drawn randomly from a noise distribution P_n(w). Labeled as negative (D = 0).

For each real pair, we generate k noise samples. So the ratio of real to noise examples is 1 : k.

The NCE objective

We train a binary classifier to distinguish real from noise. Using Bayes' rule, the posterior probability that a pair is real (given the pair and the mixture ratio) is:

P(D=1 | w, c) = P(w|c) / (P(w|c) + k · P_n(w))

NCE parameterizes log P(w|c) with a model (our dot product: v'_w · v_c) and maximizes the log-likelihood of correctly classifying real vs. noise pairs.

L_NCE = log P(D=1 | w_O, w_I) + ∑_i=1^k E_{w_i ~ P_n}[log P(D=0 | w_i, w_I)]

Why NCE works: Gutmann and Hyvarinen proved that as k → ∞, the NCE estimator converges to the true data distribution. With finite k, it's an approximation — but a very good one. The key insight is that you don't need to normalize over all V words. You just need to distinguish real data from noise. The normalizing constant is implicitly learned as part of the model parameters.

NCE's guarantee

NCE is statistically consistent: it approximates the true softmax distribution. As k increases, the approximation improves. Gutmann and Hyvarinen showed that for k = V (one noise sample per vocabulary word), NCE recovers the exact maximum likelihood estimate. But the whole point is to use k ≪ V.

The NCE loss in detail

Let's write the full NCE loss for a single (center, context) pair. Denote the model's log-probability as s(w, c) = v'_w · v_c (the unnormalized score). NCE treats the normalization constant Z as a learnable parameter (or fixes it to 1). The posterior is:

P(D=1 | w, c) = exp(s(w,c)) / (exp(s(w,c)) + k · P_n(w))

The loss for one positive pair and k noise samples is:

L = −log P(D=1 | w₊, c) − ∑_i=1^k log P(D=0 | w_i, c)

= −log P(D=1 | w₊, c) − ∑_i=1^k log(1 − P(D=1 | w_i, c))

Notice that NCE needs to compute log P_n(w) for each noise sample inside the posterior. This is an extra lookup step. Negative sampling drops this entirely.

NCE: Real vs. Noise Classification

For each real (center, context) pair, generate k noise pairs by sampling random words. Train a classifier to distinguish them. More noise samples = better approximation.

Noise samples k 5

What is the core idea of Noise Contrastive Estimation?

Convert the density estimation problem into binary classification — distinguish real data pairs from noise pairs drawn from a known noise distribution, avoiding the need to normalize over all V words Add random noise to the gradients for regularization Use a neural network to estimate the noise level in the data

Chapter 2: Negative Sampling

NCE is theoretically beautiful but carries overhead: it needs to compute the noise distribution probability P_n(w) and the posterior ratio at each step. Mikolov et al. asked: for learning word vectors, do we even need the full NCE guarantee?

The answer: no. Since we only care about vector quality (not recovering the exact probability distribution), we can simplify NCE further. The result is Negative Sampling (NEG) — a stripped-down variant that works spectacularly well in practice.

From NCE to NEG

NCE's classifier computes:

P(D=1 | w, c) = σ(v'_w · v_c − log(k · P_n(w)))

Negative sampling simplifies this by dropping the log(k · P_n(w)) correction term:

P(D=1 | w, c) ≈ σ(v'_w · v_c)

This is a raw sigmoid on the dot product. No noise distribution correction. No normalization constant. Just: "are these two vectors compatible?"

Why the simplification works

The correction term log(k · P_n(w)) compensates for the fact that frequent words appear more often as noise samples. Dropping it means the model won't recover the exact softmax probabilities. But for word vector quality, this doesn't matter. The embeddings are defined by the relative geometry of vectors, not by the exact probabilities. Negative sampling learns the same geometry as full softmax — it just doesn't guarantee the same probabilities.

The trade-off: NCE is an approximation to the softmax that converges to the exact answer as k → ∞. Negative sampling is an approximation to NCE that never converges to the exact softmax, even with infinite k. But it produces vectors of equal or better quality with less computation. Sometimes the right answer is to optimize for what you actually want (good vectors) rather than a proxy (exact probabilities).

The noise distribution

Both NCE and NEG need a noise distribution P_n(w) to sample negative words from. The paper found that the unigram distribution raised to the 3/4 power works best:

P_n(w) = f(w)^3/4 / ∑_j f(j)^3/4

where f(w) is the frequency of word w in the corpus.

Why 3/4? Raw unigram sampling (exponent 1) samples too many frequent words like "the," "a," "is." Uniform sampling (exponent 0) samples too many rare words. The 3/4 power is a sweet spot that slightly over-represents rare words relative to their frequency, giving the model more informative negatives.

Noise Distribution: Unigram^α

Compare sampling probabilities under different exponents α. At α=3/4, rare words get boosted relative to raw frequency, giving more informative negatives.

Exponent α 0.75

python
import numpy as np

def build_noise_distribution(word_freqs, alpha=0.75):
    """
    word_freqs: array of word frequencies, shape (V,)
    alpha: exponent (0.75 = paper's recommended value)
    Returns: sampling probabilities, shape (V,)
    """
    # Raise frequencies to the 3/4 power
    powered = word_freqs ** alpha     # shape: (V,)

    # Normalize to get a probability distribution
    noise_dist = powered / powered.sum()

    return noise_dist

# Example: "the" is 100x more frequent than "queen"
freqs = np.array([1000, 500, 200, 50, 10])  # the, of, cat, queen, zinc
words = ["the", "of", "cat", "queen", "zinc"]

raw = freqs / freqs.sum()             # [0.568, 0.284, 0.114, 0.028, 0.006]
neg = build_noise_distribution(freqs)  # [0.425, 0.259, 0.140, 0.058, 0.018] — rare words boosted

for w, r, n in zip(words, raw, neg):
    print(f"{w:8s}  raw: {r:.3f}  neg: {n:.3f}  boost: {n/r:.2f}x")

Why does negative sampling use f(w)^3/4 instead of raw word frequency for the noise distribution?

The 3/4 exponent slightly boosts rare words relative to their frequency, providing more informative negative samples. Raw frequency over-samples common words like "the"; uniform under-samples them The 3/4 power makes the distribution more uniform It was chosen arbitrarily

Chapter 3: The NEG Objective

Now we can write the full negative sampling objective. This is the equation that replaced the softmax and made Word2Vec trainable on billions of words.

The objective

For a (center word w_I, context word w_O) pair, the NEG objective is:

L_NEG = log σ(v'_{w_O} · v_{w_I}) + ∑_i=1^k E_{w_i ~ P_n}[log σ(−v'_{w_i} · v_{w_I})]

Let's unpack each term:

Term 1: log σ(v'_{w_O} · v_{w_I}) — the positive term. The sigmoid σ(x) = 1/(1 + exp(−x)). This is high when v'_{w_O} and v_{w_I} point in similar directions (large positive dot product). The loss encourages the output vector of the true context word to be close to the input vector of the center word.

Term 2: ∑ log σ(−v'_{w_i} · v_{w_I}) — the negative terms. Note the minus sign inside the sigmoid. σ(−x) = 1 − σ(x). This is high when v'_{w_i} and v_{w_I} point in different directions (small or negative dot product). The loss encourages the output vectors of noise words to be far from the input vector of the center word.

In plain English: "Push the real context word's vector toward the center word's vector. Push k randomly chosen words' vectors away from the center word's vector." That's it. No normalization. No partition function. No sum over V. Just k+1 dot products and k+1 sigmoid operations per training step.

The gradient

The gradient with respect to the output vector v'_w is remarkably simple:

∂L/∂v'_{w_O} = (σ(v'_{w_O} · v_{w_I}) − 1) · v_{w_I}

∂L/∂v'_{w_i} = σ(v'_{w_i} · v_{w_I}) · v_{w_I}

For the positive word: the gradient is proportional to (prediction − 1) · input vector. It pushes v'_{w_O} toward v_{w_I}. For each negative word: the gradient is proportional to (prediction − 0) · input vector. It pushes v'_{w_i} away from v_{w_I}.

Computational cost

Full softmax: O(V · d) per step

Negative sampling: O(k · d) per step

For V = 1,000,000 and k = 5: that is a 200,000x speedup. The paper recommends k = 5-20 for large datasets and k = 5-50 for small datasets.

Worked example: one NEG step

Let d = 3, center = "sat" with v_sat = [0.5, −0.2, 0.3]. True context = "cat" with v'_cat = [0.1, 0.4, −0.1]. Negative sample = "tree" with v'_tree = [−0.3, 0.1, 0.6].

Positive term:

dot₊ = v'_cat · v_sat = 0.05 − 0.08 − 0.03 = −0.06

σ(−0.06) = 0.485

L₊ = −log(0.485) = 0.724

Gradient: (σ(−0.06) − 1) · v_sat = −0.515 · [0.5, −0.2, 0.3] = [−0.258, 0.103, −0.155]. This pushes v'_cat toward v_sat.

Negative term:

dot₋ = v'_tree · v_sat = −0.15 − 0.02 + 0.18 = 0.01

σ(0.01) = 0.5025

L₋ = −log(1 − 0.5025) = −log(0.4975) = 0.698

Gradient: σ(0.01) · v_sat = 0.5025 · [0.5, −0.2, 0.3] = [0.251, −0.101, 0.151]. This pushes v'_tree away from v_sat.

Total loss: L = 0.724 + 0.698 = 1.422. After updating with learning rate 0.1, v'_cat moves 0.026 closer to v_sat and v'_tree moves 0.025 away.

Negative Sampling Training Step

For center word "sat", the true context "cat" is pushed closer (green arrows), while k random negative words are pushed away (red arrows). Click "Train Step" to see an update.

k 5

Step 0

python
import numpy as np

def neg_sampling_loss(center_id, context_id, neg_ids, W_in, W_out):
    """
    center_id:  int — center word
    context_id: int — true context word (positive)
    neg_ids:    list of int — k negative samples
    W_in:  (V, d) — input embeddings
    W_out: (V, d) — output embeddings
    Returns: loss, gradient for W_in[center_id], gradients for W_out rows
    """
    v_c = W_in[center_id]            # (d,) — center embedding

    # Positive term: push context word toward center
    v_pos = W_out[context_id]        # (d,)
    dot_pos = v_pos @ v_c            # scalar
    sig_pos = 1 / (1 + np.exp(-dot_pos))
    loss = -np.log(sig_pos + 1e-10)

    # Gradient for positive output vector
    grad_pos = (sig_pos - 1) * v_c   # (d,) — points toward center
    grad_center = (sig_pos - 1) * v_pos

    # Negative terms: push noise words away from center
    for neg_id in neg_ids:
        v_neg = W_out[neg_id]         # (d,)
        dot_neg = v_neg @ v_c         # scalar
        sig_neg = 1 / (1 + np.exp(-dot_neg))
        loss -= np.log(1 - sig_neg + 1e-10)

        # Gradient: push this negative word AWAY from center
        grad_neg = sig_neg * v_c      # (d,)
        W_out[neg_id] -= lr * grad_neg
        grad_center += sig_neg * v_neg

    # Update
    W_out[context_id] -= lr * grad_pos
    W_in[center_id] -= lr * grad_center

    return loss

# Total cost per step: (k+1) dot products + (k+1) sigmoid operations
# For k=5, d=300: 6 * 300 = 1,800 FLOPs vs. 1,000,000 * 300 for softmax

Complete PyTorch implementation

python
import torch
import torch.nn as nn
import numpy as np
from collections import Counter

class Word2VecDataset(torch.utils.data.Dataset):
    def __init__(self, corpus, vocab, window=5, k=5, t=1e-5):
        """
        corpus: list of list of int (tokenized sentences)
        vocab: dict mapping word -> index
        """
        self.k = k
        self.pairs = []

        # Build frequency table for subsampling + noise dist
        freq = Counter()
        total = 0
        for sent in corpus:
            for w in sent:
                freq[w] += 1
                total += 1

        # Subsampling probabilities
        self.keep_prob = {}
        for w, c in freq.items():
            f = c / total
            self.keep_prob[w] = np.sqrt(t / f) if f > t else 1.0

        # Noise distribution: f(w)^(3/4)
        freqs = np.array([freq.get(i, 0) for i in range(len(vocab))])
        powered = freqs ** 0.75
        self.noise_dist = powered / powered.sum()

        # Generate training pairs with subsampling
        for sent in corpus:
            # Apply subsampling
            filtered = [w for w in sent if np.random.random() < self.keep_prob.get(w, 1)]
            for i, center in enumerate(filtered):
                for j in range(max(0, i-window), min(len(filtered), i+window+1)):
                    if i != j:
                        self.pairs.append((center, filtered[j]))

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        center, context = self.pairs[idx]
        # Sample k negatives from noise distribution
        negatives = np.random.choice(len(self.noise_dist), size=self.k, p=self.noise_dist)
        return (torch.tensor(center), torch.tensor(context),
                torch.tensor(negatives, dtype=torch.long))


class SkipGramNEG(nn.Module):
    def __init__(self, V, d=300):
        super().__init__()
        self.W_in = nn.Embedding(V, d)
        self.W_out = nn.Embedding(V, d)
        nn.init.uniform_(self.W_in.weight, -0.5/d, 0.5/d)
        nn.init.zeros_(self.W_out.weight)

    def forward(self, center, context, negatives):
        v_c = self.W_in(center)                    # (B, d)
        v_pos = self.W_out(context)                # (B, d)
        v_neg = self.W_out(negatives)              # (B, k, d)

        pos_score = (v_c * v_pos).sum(1)           # (B,)
        pos_loss = -torch.nn.functional.logsigmoid(pos_score)

        neg_score = torch.bmm(v_neg, v_c.unsqueeze(2)).squeeze(2)  # (B, k)
        neg_loss = -torch.nn.functional.logsigmoid(-neg_score).sum(1)

        return (pos_loss + neg_loss).mean()

# Training loop
model = SkipGramNEG(V=10000, d=300)
optimizer = torch.optim.SparseAdam(model.parameters(), lr=0.025)

for epoch in range(5):
    for center, context, negatives in dataloader:
        loss = model(center, context, negatives)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Extract word vectors
word_vectors = model.W_in.weight.detach().numpy()  # (V, 300)

Sparse updates matter. In Skip-gram NEG, each training step only modifies k+2 embedding rows (1 center + 1 positive + k negatives). PyTorch's SparseAdam exploits this sparsity for faster training. Standard Adam would update all V rows due to how momentum works, wasting computation on untouched embeddings.

The learning rate schedule

The original Word2Vec implementation uses a linear learning rate decay: the learning rate starts at 0.025 and decreases linearly to near zero over the course of training. This is unusual — most modern systems use constant LR with warm-up, or cosine decay. But linear decay works well for Word2Vec because:

The model is shallow (no layers to coordinate), so aggressive early learning is fine
Later in training, the vectors are already well-positioned, and we want fine adjustments
The single-pass-through-corpus training makes linear decay a natural choice — the LR tracks progress through the data

python
# Linear learning rate decay in Word2Vec
def get_lr(current_word, total_words, initial_lr=0.025, min_lr=0.0001):
    """Linear decay from initial_lr to min_lr over the corpus."""
    progress = current_word / total_words
    return max(min_lr, initial_lr * (1 - progress))

# At the start:  lr = 0.025 (big steps, fast learning)
# At 50% done:   lr = 0.0125 (moderate steps)
# At 99% done:   lr = 0.00025 (tiny adjustments)

Multi-threaded training

The original C implementation uses Hogwild-style asynchronous SGD: multiple threads read from the same corpus file at different positions and update the shared embedding matrices without locks. This sounds chaotic — won't threads overwrite each other's updates? In practice, it works because:

Each step only touches k+2 out of V rows. With V = 1M and k = 5, the collision probability is ~7/1M ≈ 0.0007% per step.
Even when collisions occur, they just add noise to the gradient, which SGD handles gracefully.
The speedup from N threads is nearly linear: 8 threads ≈ 8x faster.

Memory layout for performance

The original C implementation stores word vectors as a contiguous float array of size V × d. Each vector lookup is pointer arithmetic: syn0 + word_id * d. No hash table, no indirection. For V = 1M and d = 300 in float32, the two embedding matrices total 2 × 1M × 300 × 4 = 2.4 GB — comfortably fitting in RAM on 2013 hardware.

Modern PyTorch implementations use nn.Embedding which has the same memory layout but adds autograd bookkeeping. For maximum training speed, the original C code outperforms PyTorch by ~3-5x due to avoiding Python overhead and using raw BLAS operations.

The word2vec.c implementation

The original implementation by Mikolov is a single C file of ~650 lines. Key design decisions:

File-based parallel processing: Each thread reads from a different offset in the same file. No message passing needed.
Linear learning rate decay: LR decreases linearly from 0.025 to near-zero over the corpus. Global counter tracks progress.
Precomputed sigmoid table: A lookup table of 1000 precomputed sigmoid values avoids calling exp() at runtime. Lookups are faster than computing exp().
Unigram table: 100M-entry array for O(1) negative sampling (no cumulative distribution needed).

c
// Simplified core of word2vec.c (Skip-gram with NEG)
// The actual training loop for one (center, context) pair

// Positive example: push context toward center
for (c = 0; c < layer1_size; c++)
  neu1e[c] = 0;

for (d = 0; d < negative + 1; d++) {
  if (d == 0) {
    target = word;    // positive: true context word
    label = 1;
  } else {
    target = table[rand() % table_size];  // O(1) negative sample
    label = 0;
  }
  l2 = target * layer1_size;  // pointer to output vector
  f = 0;
  for (c = 0; c < layer1_size; c++)
    f += syn0[l1 + c] * syn1neg[l2 + c];  // dot product
  g = (label - expTable[sigmoid_index(f)]) * alpha;  // gradient
  for (c = 0; c < layer1_size; c++)
    neu1e[c] += g * syn1neg[l2 + c];  // accumulate input gradient
  for (c = 0; c < layer1_size; c++)
    syn1neg[l2 + c] += g * syn0[l1 + c];  // update output vector
}
// Update input vector
for (c = 0; c < layer1_size; c++)
  syn0[l1 + c] += neu1e[c];

What does the negative sampling objective do in plain English?

For each real (center, context) pair, push the context word's vector toward the center word's vector, while pushing k randomly sampled negative words' vectors away — using only k+1 dot products instead of V Sample k positive words and average their gradients Compute a weighted sum over all vocabulary words

Chapter 4: Subsampling Frequent Words

In a typical English corpus, words like "the," "a," "in," and "is" appear millions of times. The word "the" might account for 7% of all tokens. This creates two problems for Word2Vec:

Wasted computation. The pair ("the", "cat") appears thousands of times. After the first few hundred, the model has learned everything it can from this pair. The remaining thousands of examples contribute almost zero gradient but still cost compute.
Semantic dilution. "The" appears in the context of every word. Its context is so broad that it carries almost no information about any specific word. Including "the" in every window adds noise rather than signal.

The paper's solution: randomly discard frequent words during training, with a probability that depends on word frequency.

The subsampling formula

Each word w in the training corpus is discarded with probability:

P(discard w) = 1 − √(t / f(w))

where f(w) is the frequency of word w (fraction of total tokens) and t is a threshold, typically t = 10⁻⁵.

How this works in practice

Consider a word with frequency f(w) = 0.01 (1% of corpus, like "of") and threshold t = 10⁻⁵:

P(discard "of") = 1 − √(10⁻⁵ / 0.01) = 1 − √0.001 = 1 − 0.032 = 0.968

So 96.8% of occurrences of "of" are discarded! Only 3.2% survive.

For a word with frequency f(w) = 10⁻⁵ (rare word, same as threshold):

P(discard rare) = 1 − √(10⁻⁵ / 10⁻⁵) = 1 − 1 = 0

Rare words are never discarded.

Why √ and not linear? The square root creates a smooth transition. A linear discard (proportional to frequency) would be too aggressive — it would discard most occurrences of moderately common words. The square root is gentler: even very common words retain a meaningful fraction of their occurrences, but they no longer dominate the training data.

Subsampling by Frequency

The bar chart shows how many occurrences survive subsampling. Frequent words (left) are aggressively pruned; rare words (right) are untouched. Drag the threshold t.

Threshold t 10^-5

python
import numpy as np

def subsample(word_freqs, threshold=1e-5):
    """
    word_freqs: dict mapping word -> frequency (fraction of total tokens)
    Returns: dict mapping word -> keep probability
    """
    keep_probs = {}
    for word, freq in word_freqs.items():
        if freq > threshold:
            keep_probs[word] = np.sqrt(threshold / freq)
        else:
            keep_probs[word] = 1.0  # rare words: always keep
    return keep_probs

# Example
freqs = {"the": 0.07, "of": 0.03, "cat": 0.0001, "queen": 0.00001}
keep = subsample(freqs)
# the: 1.2% kept, of: 1.8% kept, cat: 31.6% kept, queen: 100% kept

The double benefit

Subsampling helps in two ways simultaneously:

Speed: The effective corpus size shrinks dramatically. With aggressive subsampling, training is 2-10x faster.
Quality: By removing "the" from most context windows, the remaining context words are more informative. The paper reports that subsampling improved analogy accuracy by several percentage points — it's not just faster, it's better.

Worked example: effect on context windows

Original sentence: "the cat sat on the fluffy mat"

With t = 10⁻⁵, most occurrences of "the" and "on" are discarded. A likely surviving version:

After subsampling: "cat sat fluffy mat"

Now with window c = 2, the context of "sat" is {cat, fluffy, mat} instead of {the, cat, on, the}. Every context word is meaningful. "The" and "on" added only noise — they co-occur with everything and carry no discriminative information.

Before and After Subsampling

Frequent words like "the" and "on" are removed, leaving only informative context words. The effective window now connects content words that were previously separated by function words.

Effective window expansion

Subsampling has an additional hidden benefit: it effectively increases the context window. When frequent words are removed, distant words that were outside the original window now become visible. If "the" between "cat" and "fluffy" is removed, those words become adjacent, effectively making c larger for rare, informative words.

Numbers from the paper: On the Google News corpus (6B tokens), subsampling with t = 10⁻⁵ discarded 78% of "the" occurrences but only 0.3% of "queen." The effective corpus shrank from 6B to ~2.5B tokens, while analogy accuracy improved from 56% to 61%. Training time dropped proportionally — a rare "free lunch" in ML.

Subsampling formula derivation

The subsampling formula P(discard) = 1 − √(t/f(w)) was chosen empirically, but it has an intuitive derivation. We want a keep probability that:

Is 1 (always keep) when f(w) ≤ t
Decreases as f(w) increases above t
Never reaches 0 (even "the" is kept sometimes)
Keeps a roughly constant number of each word's occurrences in the effective corpus

The √(t/f) formula approximately satisfies the last condition. If word w appears N · f(w) times in a corpus of N tokens, after subsampling it appears approximately:

N · f(w) · √(t / f(w)) = N · √(t · f(w))

For very frequent words (f ≫ t), the surviving count grows as √f — much slower than linear f. For a word appearing 10⁷ times, only about 10^3.5 ≈ 3,162 occurrences survive. This is still enough to learn a good vector, but doesn't waste compute on redundant examples.

python
# Visualizing the effect of subsampling on a Zipfian distribution
import numpy as np

# Zipf's law: frequency of rank-r word ≈ C/r
V = 100000
ranks = np.arange(1, V+1)
freqs = 1.0 / ranks
freqs = freqs / freqs.sum()  # normalize to probabilities

t = 1e-5
keep_probs = np.sqrt(t / freqs)
keep_probs = np.clip(keep_probs, 0, 1)

# Top 10 words: original vs. effective frequency
for i in range(10):
    print(f"Rank {i+1}: freq={freqs[i]:.6f} keep={keep_probs[i]:.4f} effective={freqs[i]*keep_probs[i]:.6f}")

# Output shows that rank-1 word goes from 8.3% to 0.03% of effective corpus
# while rank-1000 word goes from 0.008% to 0.003% — much less dramatic

Why does subsampling frequent words actually improve word vector quality, not just speed up training?

Removing ubiquitous words like "the" from context windows makes the remaining context words more informative, reducing noise in the training signal It increases the embedding dimension It allows using a larger vocabulary

Chapter 5: Phrase Detection

Word2Vec treats every token as atomic, but many natural concepts span multiple words. "New York" is not the sum of "New" and "York." "Ice cream" has nothing to do with "ice" or "cream" individually. "Machine learning" is a concept, not two separate words.

The paper introduces a simple, elegant algorithm to automatically detect phrases in the corpus and treat them as single tokens.

The phrase scoring function

For each pair of adjacent words (w_i, w_j), compute a score based on pointwise mutual information (PMI):

score(w_i, w_j) = (count(w_i, w_j) − δ) / (count(w_i) · count(w_j))

Where:

count(w_i, w_j) is how often the bigram appears in the corpus
count(w_i), count(w_j) are individual word counts
δ is a discounting threshold that prevents very infrequent combinations from forming phrases

The intuition

If "New" appears 500,000 times and "York" appears 200,000 times in a corpus of 1 billion words, random chance would predict them appearing together roughly (500,000/10⁹) × (200,000/10⁹) × 10⁹ = 100 times. If "New York" actually appears 150,000 times, it's occurring 1,500x more often than chance predicts. That is a phrase.

Meanwhile, "this is" might appear 500,000 times, but "this" appears 10 million times and "is" appears 8 million times. By chance alone, we'd expect (10M/10⁹) × (8M/10⁹) × 10⁹ = 80,000. So "this is" is only ~6x above chance. Not a phrase.

Iterative detection

The algorithm runs multiple passes. In the first pass, it detects bigrams ("New_York", "ice_cream"). In the second pass, it can detect trigrams by combining bigrams with adjacent words ("New_York_City"). In practice, 2-4 passes capture most useful phrases.

Worked example: scoring bigrams

Corpus size: 1 billion words. Consider two candidate phrases:

"New York":

count("New") = 500,000, count("York") = 200,000, count("New York") = 150,000

score = (150,000 − 5) / (500,000 × 200,000) × 10⁹ = 150,000 / 10¹¹ × 10⁹ = 1,500

Score of 1,500 is far above any reasonable threshold. "New York" is definitely a phrase.

"this is":

count("this") = 10,000,000, count("is") = 8,000,000, count("this is") = 500,000

score = (500,000 − 5) / (10,000,000 × 8,000,000) × 10⁹ = 500,000 / 8 × 10¹³ × 10⁹ = 6.25

Score of 6.25 is well below typical thresholds (δ ≈ 100). "This is" is just two common words that happen to co-occur.

The δ discounting parameter

The δ parameter in the numerator (count(w_i, w_j) − δ) serves as a minimum count threshold. If a bigram appears fewer than δ times, the numerator becomes negative and the pair is automatically rejected. This prevents statistical flukes from creating spurious phrases. The paper uses δ = 5 as a default.

The δ parameter: The discounting threshold δ prevents rare word pairs from being treated as phrases. If two rare words happen to appear together once or twice, their PMI will be very high (since their expected co-occurrence is near zero) but the evidence is too weak to justify calling it a phrase. δ effectively requires a minimum count before a bigram can be considered.

Phrase Detection via PMI

Bigram scores based on pointwise mutual information. High scores (teal) indicate genuine phrases; low scores (dim) are coincidental word pairs. Drag the threshold to see which bigrams qualify.

Threshold 50

python
def detect_phrases(corpus, min_count=5, threshold=100, delta=5):
    """
    corpus: list of list of str (tokenized sentences)
    Returns: corpus with phrases joined by '_'
    """
    from collections import Counter

    # Count unigrams and bigrams
    unigrams = Counter()
    bigrams = Counter()
    total = 0

    for sentence in corpus:
        for i, word in enumerate(sentence):
            unigrams[word] += 1
            total += 1
            if i < len(sentence) - 1:
                bigrams[(word, sentence[i+1])] += 1

    # Score each bigram
    phrases = set()
    for (w1, w2), count in bigrams.items():
        if count < min_count:
            continue
        score = (count - delta) / (unigrams[w1] * unigrams[w2]) * total
        if score > threshold:
            phrases.add((w1, w2))

    # Replace bigrams with phrases
    new_corpus = []
    for sentence in corpus:
        new_sent = []
        i = 0
        while i < len(sentence):
            if i < len(sentence) - 1 and (sentence[i], sentence[i+1]) in phrases:
                new_sent.append(sentence[i] + "_" + sentence[i+1])
                i += 2
            else:
                new_sent.append(sentence[i])
                i += 1
        new_corpus.append(new_sent)

    return new_corpus

How does the phrase detection algorithm decide that "New York" is a phrase but "this is" is not?

It uses pointwise mutual information: "New York" appears far more often than predicted by the independent frequencies of "New" and "York," while "this is" appears roughly as often as expected by chance It uses a dictionary of known phrases It uses the cosine similarity of the word vectors

Chapter 6: Phrase Vectors

Once phrases are detected and the corpus is preprocessed (replacing "New" "York" with "New_York"), we simply train Word2Vec as before. The phrase tokens are treated exactly like any other word. They get their own embedding vectors, learned from context just like single words.

What phrase vectors learn

The beautiful thing is that phrase vectors capture meaning that is not the sum of their parts. The vector for "New_York" is not simply vec("New") + vec("York"). It's a new vector that sits in its own location in the embedding space, near other city names and geographic entities.

Examples from the paper:

Phrase	Nearest neighbors
New_York	New_York_City, Los_Angeles, Chicago, Manhattan
ice_cream	frozen_yogurt, gelato, dessert, sorbet
machine_learning	deep_learning, NLP, computer_vision, AI
Barack_Obama	Obama, president, White_House, Biden
hot_dog	hamburger, sandwich, sausage, pizza

Phrase analogies

The analogy property extends to phrases:

vec("Montreal_Canadiens") − vec("Montreal") + vec("Toronto") ≈ vec("Toronto_Maple_Leafs")

vec("German") − vec("Germany") + vec("France") ≈ vec("French")

This works because phrases, like words, live in the same vector space and are learned from the same distributional signal. A phrase's context defines its meaning, just like a word's context defines its meaning.

Vocabulary expansion: Phrase detection dramatically increases the effective vocabulary without increasing model complexity. A corpus with 1M unique words might yield 2M unique tokens (words + phrases) after phrase detection. Each token gets its own d-dimensional vector. The embedding matrix grows from (1M, d) to (2M, d), but the per-step cost stays O(k · d) with negative sampling — independent of vocabulary size.

Phrase detection statistics

Running phrase detection on the Google News corpus (33B tokens):

Pass	New phrases found	Example
1st pass (bigrams)	~500,000	New_York, ice_cream, machine_learning
2nd pass (trigrams)	~200,000	New_York_City, ice_cream_cone, New_York_Times
3rd pass (4-grams)	~50,000	New_York_Stock_Exchange
4th pass	~10,000	Diminishing returns

The original vocabulary of ~1M words expanded to ~3M tokens (words + phrases). But only ~500K of these phrases appeared frequently enough to learn reliable vectors.

Phrase analogies work too

The paper discovered that phrase vectors support the same analogy arithmetic as word vectors:

vec("Montreal_Canadiens") − vec("Montreal") + vec("Toronto") ≈ vec("Toronto_Maple_Leafs")

vec("Steve_Ballmer") − vec("Microsoft") + vec("Apple") ≈ vec("Steve_Jobs")

vec("New_Zealand") − vec("Wellington") + vec("Canberra") ≈ vec("Australia")

These analogies work because phrases live in the same vector space as words and are trained with the same objective. The distributional hypothesis applies equally: "Montreal_Canadiens" appears in similar contexts to other hockey teams.

Phrase Embedding Space

Phrases cluster near semantically related words and other phrases. "New_York" is near other cities, not near "new" or "york" individually. Click a cluster to highlight.

Why is the phrase vector for "New_York" different from vec("New") + vec("York")?

"New_York" is treated as a single token with its own embedding, learned from its own context (other city names, geographic entities), which captures the meaning of the whole phrase — not the compositional meaning of its parts The model subtracts the common component Phrase vectors are computed using a different model

Chapter 7: Compositionality

One of the most fascinating findings of the paper is about additive compositionality: the meaning of short phrases can sometimes be approximated by vector addition of their component words.

This seems to contradict what we just said about "New_York" ≠ vec("New") + vec("York"). So what's going on?

Two types of phrases

The paper distinguishes between non-compositional and compositional phrases:

Type	Example	vec(phrase) ≈ sum of parts?
Non-compositional	"hot dog", "New York", "kick the bucket"	No — meaning unrelated to parts
Compositional	"strong coffee", "big city", "fast car"	Yes — meaning is roughly additive

For compositional phrases, the paper shows that:

vec("Czech") + vec("currency") ≈ vec("koruna")

vec("Vietnam") + vec("capital") ≈ vec("Hanoi")

Why does addition work for compositional meaning?

The paper offers an intuitive explanation rooted in the training objective. In Skip-gram with negative sampling, the training signal for a word is determined by its context. If two words w₁ and w₂ each independently predict a set of context words, then the vector v₁ + v₂ will have high dot product with context words of both w₁ and w₂.

The sum vec("Czech") + vec("currency") activates context words related to both Czech things and currency things. The word that best matches this intersection is "koruna" — the Czech currency. The addition acts as an AND operation in the semantic space.

Limitation: Additive compositionality is approximate and breaks down for long phrases, idiomatic expressions, and negation. vec("not good") ≠ vec("bad"). vec("kicked the bucket") ≠ vec("died"). These require more sophisticated composition functions — which later models like LSTMs and Transformers would provide. Word2Vec's additive compositionality is a beautiful special case, not a general theory of meaning composition.

Compositionality failure modes

Understanding when addition fails is as instructive as understanding when it works:

Query	Expected	Actual nearest	Why it fails
not + good	bad	great	Negation is not additive; "not" doesn't negate in vector space
hot + dog	hot_dog	puppy	Idiomatic meaning unrelated to component words
red + herring	distraction	salmon	Metaphorical usage; literal meanings dominate
kick + bucket	die	pail	Idiom; literal combination wins

The pattern: addition works for compositional meanings (Czech + currency = koruna) but fails for idiomatic, metaphorical, or negated meanings. This is a fundamental limitation of static word vectors — they encode distributional similarity, not compositional semantics.

Why addition works: a formal argument

In Skip-gram, the training objective maximizes the dot product between co-occurring words. If words A and B independently co-occur with a set of words S, then:

v_A · v'_s ≈ log P(s | A) for s ∈ S(A)

v_B · v'_s ≈ log P(s | B) for s ∈ S(B)

The sum vector v_A + v_B has high dot product with any word s that is in both S(A) and S(B):

(v_A + v_B) · v'_s = v_A · v'_s + v_B · v'_s

This sum is large only when both terms are large — i.e., when s co-occurs with both A and B. This is an AND operation in the embedding space. The word whose context best matches the intersection of A's and B's contexts will have the highest similarity to v_A + v_B.

Worked example

Why does vec("Czech") + vec("currency") ≈ vec("koruna")?

vec("Czech") has high dot products with: Prague, Bohemia, Slovakia, Europe, koruna, Czech_Republic
vec("currency") has high dot products with: dollar, euro, yen, pound, koruna, exchange
The intersection — words that score high with both — includes "koruna" most prominently
So the word nearest to vec("Czech") + vec("currency") is "koruna"

This only works because "koruna" is genuinely at the intersection of Czech-related and currency-related contexts. For non-compositional phrases like "hot dog," the intersection of hot-related and dog-related contexts gives "puppy" or "warm," not "hamburger."

Compositionality test set

The paper created a test set of ~3,000 compositional queries to evaluate phrase vectors:

Category	Example query	Expected answer	Accuracy
Country + currency	Czech + currency	koruna	62%
Country + capital	Vietnam + capital	Hanoi	58%
Country + language	Japan + language	Japanese	71%
City + state	Dallas + state	Texas	49%
Person + field	Einstein + field	physics	45%

Country-language pairs work best (71% accuracy) because the relationship is consistent and well-represented in news text. City-state and person-field are harder because the associations are more ambiguous.

python
# Testing compositionality
def test_composition(word_a, word_b, expected, word_vectors, top_k=10):
    """Test if vec(a) + vec(b) ≈ vec(expected)"""
    target = word_vectors[word_a] + word_vectors[word_b]
    target = target / (np.linalg.norm(target) + 1e-10)

    # Find nearest neighbors to the sum
    sims = {}
    for word, vec in word_vectors.items():
        if word in {word_a, word_b}:
            continue
        vec_norm = vec / (np.linalg.norm(vec) + 1e-10)
        sims[word] = np.dot(target, vec_norm)

    ranked = sorted(sims.items(), key=lambda x: -x[1])[:top_k]
    rank = None
    for i, (w, s) in enumerate(ranked):
        if w == expected:
            rank = i + 1
            break

    print(f"{word_a} + {word_b} → {ranked[0][0]} (expected: {expected}, rank: {rank})")
    return rank == 1  # Correct if expected word is #1

# Example
test_composition("Czech", "currency", "koruna", vectors)
# Czech + currency → koruna (expected: koruna, rank: 1)

Additive Composition Explorer

vec(A) + vec(B) ≈ vec(C). The sum of two word vectors lands near the word that captures their intersection. Try different pairs to see which compositions work.

Why does vec("Czech") + vec("currency") approximate vec("koruna")?

The sum activates context words related to both Czech things and currency things. "Koruna" (the Czech currency) has the highest dot product with this sum because it appears in contexts of both "Czech" and "currency" The model was trained with "koruna" as a target for the pair "Czech currency" Vector addition always produces the semantically correct result

Chapter 8: Results

The paper evaluates the three extensions (negative sampling, subsampling, phrases) on the word analogy task from the original Word2Vec paper, extended with a phrase analogy section.

Negative sampling vs. alternatives

Method	Semantic Acc %	Syntactic Acc %	Total Acc %	Training Speed
Hierarchical Softmax	53	55	54	1x
NCE (k=5)	52	54	53	~1.3x
NEG (k=5)	61	61	61	~1.5x
NEG (k=15)	63	58	61	~1x

Negative sampling with k=5 outperforms both hierarchical softmax and NCE on analogy accuracy, while being the fastest to train. The simpler objective (no correction terms, no tree) learns better vectors.

Effect of subsampling

Subsampling	Semantic Acc %	Syntactic Acc %	Speed
No subsampling	56	54	1x
t = 10^-5	61	61	~2.5x

Subsampling improves both speed (2.5x) and accuracy (5+ points). This is the rare optimization that makes everything better.

Phrase analogy results

The paper extends the analogy test to include phrases. Example phrase analogies:

City analogy

Montreal_Canadiens : Montreal :: Toronto : Toronto_Maple_Leafs

CEO analogy

Steve_Ballmer : Microsoft :: Tim_Cook : Apple

Country analogy

New_Zealand : Wellington :: Australia : Canberra

The best model (Skip-gram, NEG k=5, subsampling t=10⁻⁵, phrase detection, d=300, trained on 33 billion words) achieved 72% accuracy on word analogies and 66% on phrase analogies.

Method Comparison: Accuracy vs. Training Speed

Negative sampling achieves the best accuracy-speed tradeoff. Bubble size represents training speed (larger = faster).

Training scale: The final model was trained on an internal Google News dataset of 33 billion words — 5x larger than the original paper's 6B. With negative sampling (k=5) and subsampling (t=10^-5), this massive dataset could be processed in a single day on a multi-core machine. The combination of all three techniques (NEG + subsampling + phrases) increased analogy accuracy from 54% to 72% — a relative improvement of 33%.

Choosing k: how many negative samples?

The paper's findings on the optimal number of negative samples k:

Dataset size	Recommended k	Reasoning
Small (<100M words)	15-50	Fewer data points; need more negatives per step to get sufficient gradient signal
Medium (100M-1B)	5-15	Moderate data; k=5-10 gives a good speed-quality tradeoff
Large (>1B words)	2-5	Abundance of data means each step can be cheaper; the sheer volume of steps compensates

The intuition: with more data, you see more unique (center, context) pairs. Each pair teaches the model something. With less data, you see the same pairs repeatedly, so you need more negatives per pair to extract sufficient information from each step.

The 3/4 exponent experiment

The paper tested different noise distribution exponents:

Exponent α	Analogy accuracy %
0 (uniform)	52
0.25	56
0.5	58
0.75	61
1.0 (unigram)	57

The 3/4 power consistently outperformed both uniform and pure unigram sampling. This value has since been adopted as a standard across many contrastive learning methods.

Ablation study: contribution of each trick

The paper's results allow us to estimate the contribution of each innovation:

Configuration	Accuracy %	Contribution
Hier. softmax only	54	Baseline
+ Negative sampling (k=5)	61	+7 points (better training signal)
+ Subsampling (t=10⁻⁵)	66	+5 points (cleaner context)
+ Phrase detection	68	+2 points (multi-word concepts)
+ More data (33B vs 6B)	72	+4 points (scale always helps)

Negative sampling and subsampling together contribute +12 points — more than all other improvements combined. This reinforces the paper's central message: getting the training dynamics right (what to predict, what to ignore) matters more than model architecture.

Sensitivity to hyperparameters

The paper found the method to be remarkably robust to hyperparameter choices:

k (negative samples): k=5 and k=15 give nearly identical accuracy (61% vs. 61%). Only k=1 degrades noticeably.
t (subsampling): t = 10⁻⁵ is optimal, but t = 10⁻⁴ and t = 10⁻⁶ are within 1-2 points.
Window c: c=5 and c=10 differ by ~2 points, with the optimal depending on whether the task is syntactic or semantic.
Dimension d: d=300 and d=1000 differ by ~4 points, with diminishing returns above 300.

The most sensitive hyperparameter is the learning rate and its decay schedule. Setting the initial LR too high causes divergence; too low causes slow convergence and poor vectors.

Why NEG beats hierarchical softmax on Skip-gram

An interesting asymmetry: hierarchical softmax works best with CBOW, while negative sampling works best with Skip-gram. Why?

CBOW + HS: CBOW averages context vectors into a single hidden vector, which is then used to traverse the tree. The averaging creates a smooth, well-defined input for the tree's binary classifiers.
Skip-gram + HS: Skip-gram uses a single center word to predict each context word independently. The tree must encode fine-grained word distinctions from a single input vector, which is harder.
Skip-gram + NEG: Negative sampling's binary classification task (is this a real pair?) is well-matched to Skip-gram's one-to-one pair structure. Each (center, context) pair gets its own positive/negative discrimination, which provides a cleaner gradient signal than navigating a shared tree.

The impact on NLP research 2013-2018

The two Word2Vec papers (Mikolov et al., 2013a, 2013b) are among the most cited papers in AI/NLP. Together they accumulated over 50,000 citations by 2023. Their impact went far beyond word embeddings:

Pre-training paradigm: The idea of unsupervised pre-training followed by task-specific fine-tuning became the dominant ML paradigm. ELMo, BERT, GPT all follow this template.
Transfer learning for NLP: Before Word2Vec, NLP models were trained from scratch for each task. After Word2Vec, starting with pre-trained representations became standard — cutting training time and improving performance on small datasets.
Embedding as a primitive: The concept of "embedding" — mapping discrete objects to continuous vectors — became a universal building block. User embeddings, product embeddings, graph embeddings, protein sequence embeddings — all trace back to Word2Vec.
Simple > complex: Word2Vec showed that shallow models trained on massive data can outperform deep models on small data. This influenced later scale-focused approaches like GPT-3.

The timeline of influence: Word2Vec (2013) → GloVe (2014) → FastText (2017) → ELMo (2018) → BERT (2018) → GPT-2 (2019) → GPT-3 (2020) → ChatGPT (2022). The thread connecting them all: prediction tasks on raw text yield useful representations. Word2Vec proved the concept; each successor scaled it up.

Comparison: the three Word2Vec training objectives

The two Word2Vec papers introduced three training methods. Here is a detailed comparison of their mathematical properties:

Property	Full Softmax	Hier. Softmax	Neg. Sampling
Objective	Cross-entropy over V	Product of log V sigmoids	1 + k binary sigmoids
Exact probabilities?	Yes	Yes	No
Cost per step	O(V · d)	O(log V · d)	O(k · d)
Parameters updated	All V output vectors	log V node vectors	k + 1 output vectors
Good for rare words?	Yes (all words scored)	Depends on tree (longer paths for rare)	Yes (rare words appear as negatives)
Implicit factorization	Softmax matrix	Binary tree structure	Shifted PMI matrix
Best paired with	Small vocab only	CBOW	Skip-gram

Practical recommendations from the paper

The paper's final recommendations for practitioners:

Use Skip-gram + negative sampling (k=5) for best quality on most tasks
Use subsampling with t = 10⁻⁵ — it's free performance and speed
Use phrase detection (2 passes) if your task involves multi-word entities
Use d = 300 as the default dimension — higher gives diminishing returns
Use window c = 5-10, with c = 10 for semantic tasks and c = 5 for syntactic
More data is always better — prioritize data quantity over model tuning
One epoch is usually sufficient for large corpora (>1B words)

Evaluating phrase vectors

Beyond the analogy task, the paper evaluated phrase vectors on several other benchmarks:

SemEval 2012 Task 2: Measuring relational similarity (e.g., "mason is to stone as carpenter is to wood"). The phrase-enhanced model achieved higher Spearman correlation with human judgments than word-only models.

Qualitative analysis: The paper showed that phrase vectors cluster sensibly:

Sports teams cluster: "Chicago_Bulls," "Los_Angeles_Lakers," "Boston_Celtics" are nearest neighbors
Technology companies cluster: "Google," "Microsoft," "Apple," "Facebook" are mutually close
Political figures cluster: "Barack_Obama," "Bill_Clinton," "George_Bush" form a tight group

These clusters emerge entirely from the text — no knowledge base, no labeled data, no human annotation. The distributional hypothesis works for phrases just as it works for individual words.

Impact on phrase-level NLP

Phrase detection and phrase vectors influenced subsequent work in several ways:

Named entity handling: Treating "New York" as a single token improved NER systems by giving entities consistent representations.
Information retrieval: Searching with phrase-aware embeddings (where "New_York" has its own vector) improved document retrieval for multi-word queries.
Subword tokenization: The idea of treating common subword sequences as single tokens was later scaled up by BPE (Byte Pair Encoding) and SentencePiece, which are used in all modern transformers. BPE can be seen as a generalization of phrase detection to arbitrary subword units.

From phrases to subwords: Word2Vec's phrase detection used PMI to find multi-word expressions like "New_York." BPE (used by GPT, LLaMA, etc.) uses a similar idea at the character level — repeatedly merge the most frequent adjacent character pair. Both algorithms build a vocabulary of common sequences from corpus statistics. The intellectual lineage is direct: phrases → subwords → modern tokenization.

The negative sampling objective: a modern perspective

Looking back from 2024, negative sampling's most important contribution wasn't making Word2Vec fast — it was establishing contrastive learning as a paradigm. The insight that "learning by distinguishing positive from negative examples" is sufficient to learn useful representations has been verified in dozens of subsequent papers across NLP, vision, and multimodal AI.

Key lessons from negative sampling that generalize to modern contrastive learning:

Hard negatives matter: The 3/4 exponent boosts rare words as negatives, making them "harder" to distinguish from positives. Modern contrastive methods (HardCL, MoCo v2) similarly benefit from mining hard negatives.
k doesn't need to be large: k = 5 works well for Word2Vec. CLIP uses batch size as implicit k. SimCLR uses large batches (4096) as negatives. The trend: more negatives help, but diminishing returns set in quickly.
The noise distribution matters: Uniform sampling underperforms frequency-based sampling. In vision, using augmented versions of other images in the batch works better than random noise.
Exact probabilities aren't needed: NEG sacrifices exact softmax for better vectors. Modern methods (InfoNCE, NT-Xent) also use unnormalized objectives and achieve state-of-the-art results.

The elegance of negative sampling is that it solves two problems at once: it makes training tractable (by avoiding the full softmax) and it provides a natural way to define what "similar" and "dissimilar" mean (through the noise distribution). This dual purpose is why the contrastive learning paradigm has proven so durable.

What was the combined effect of negative sampling, subsampling, and phrase detection on word analogy accuracy?

Accuracy improved from 54% (hierarchical softmax, no subsampling) to 72% (NEG k=5, subsampling, phrases) — a 33% relative improvement — while also being faster to train Accuracy stayed the same but training was faster Accuracy improved from 90% to 99%

Chapter 9: Connections

What this paper built on

Word2Vec (Mikolov et al., 2013a): The original paper introducing CBOW and Skip-gram with hierarchical softmax. This follow-up provides the practical extensions that made Word2Vec the dominant word embedding method. See our companion veanor.

Noise Contrastive Estimation (Gutmann & Hyvarinen, 2010): The theoretical foundation for negative sampling. NCE provides the statistical guarantee; NEG simplifies it for the specific case of learning embeddings.

Pointwise Mutual Information: The phrase detection algorithm is based on PMI, a fundamental concept from information theory. Later work by Levy and Goldberg (2014) showed that Skip-gram with negative sampling implicitly factorizes a word-context PMI matrix.

What this paper enabled

GloVe (Pennington et al., 2014): Showed the connection between predictive methods (Word2Vec) and count-based methods (LSA). See our GloVe veanor.

FastText (Bojanowski et al., 2017): Extended Skip-gram with negative sampling to use character n-grams, enabling out-of-vocabulary word handling and better morphological representations.

Contrastive learning in deep learning: The negative sampling idea — learn by distinguishing positives from negatives — became foundational. SimCLR, MoCo, CLIP, and other contrastive methods all use the same principle: push positive pairs together and negative pairs apart, without computing a full normalization.

Word embeddings as features: Pre-trained Word2Vec/GloVe embeddings became the standard initialization for NLP models from 2013-2018, preceding the contextual embedding revolution (ELMo, BERT).

The implicit matrix factorization

Levy and Goldberg (2014) proved a remarkable theorem about Skip-gram NEG: when trained to convergence, the word and context vectors satisfy:

w_i · w'_j = PMI(i, j) − log k

where PMI(i, j) = log(P(i,j) / (P(i)P(j))) is the pointwise mutual information. This means Skip-gram NEG is implicitly factorizing a shifted PMI matrix. The "neural network" is doing matrix factorization in disguise.

This insight was hugely influential. It showed that the count-vs-predict debate was a false dichotomy — both approaches were optimizing closely related objectives. It also explained why word vectors have linear structure: PMI captures the log-odds of co-occurrence, and log-odds are additive by nature.

From Word2Vec to modern contrastive learning

The negative sampling paradigm can be summarized as: "learn representations by distinguishing real data from noise." This pattern appears throughout modern deep learning:

Method	Year	Positive pair	Negative pair
Word2Vec NEG	2013	(center word, context word)	(center word, random word)
SimCLR	2020	(image, augmented image)	(image, different image)
MoCo	2020	(query, matching key)	(query, queue of keys)
CLIP	2021	(image, matching text)	(image, wrong text)
DPO	2023	(prompt, preferred response)	(prompt, dispreferred response)

Every row follows the same template: push positive pairs together, push negative pairs apart. Word2Vec NEG was the prototype.

The complete Word2Vec recipe (2013)

Combining everything from both papers, the full Word2Vec training recipe is:

1. Preprocess

Tokenize corpus. Build vocabulary (min count = 5). Compute word frequencies.

↓

2. Phrase detection

2-4 passes of PMI-based bigram scoring. "New York" → "New_York".

↓

3. Build tables

Compute subsampling probabilities. Build 100M-entry unigram table for negative sampling.

↓

4. Train

Skip-gram with NEG (k=5). Linear LR decay 0.025 → 0.0001. Hogwild multi-threading.

↓

5. Output

Save W_in as word vectors. Binary format for gensim, text format for general use.

The complete training pipeline in Python

python
# Complete Word2Vec pipeline using gensim
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser

# Step 1: Load tokenized corpus
corpus = [["new", "york", "is", "a", "big", "city"],
          ["ice", "cream", "is", "cold"],
          # ... millions more sentences]

# Step 2: Phrase detection
phrases = Phrases(corpus, min_count=5, threshold=100)
phraser = Phraser(phrases)
corpus_with_phrases = [phraser[sent] for sent in corpus]
# ["new_york", "is", "a", "big", "city"]

# Step 3-4: Train
model = Word2Vec(
    corpus_with_phrases,
    vector_size=300,
    window=5,
    min_count=5,
    sg=1,           # Skip-gram
    negative=5,     # k = 5 negative samples
    sample=1e-5,    # subsampling threshold
    workers=8,
    epochs=5,
)

# Step 5: Save and use
model.save("word2vec.model")
model.wv.save_word2vec_format("vectors.bin", binary=True)

# Use phrase vectors
print(model.wv.most_similar("new_york"))
# [('los_angeles', 0.82), ('chicago', 0.79), ('manhattan', 0.77), ...]

Cheat sheet

Negative Sampling

Replace O(V) softmax with O(k) binary classification. k=5 for large data. Noise dist: f(w)^3/4

Subsampling

Discard frequent words with P = 1 − √(t/f). Faster AND better. t = 10⁻⁵

Phrase Detection

PMI-based bigram scoring. score = (count_ij − δ) / (count_i · count_j). Iterative for n-grams

Compositionality

vec(A) + vec(B) ≈ vec(concept combining A and B). Works for compositional, not idiomatic phrases

Impact

54% → 72% analogy accuracy. Launched contrastive learning paradigm. Standard NLP features for 5+ years

How did the negative sampling idea influence modern deep learning beyond NLP?

The principle of learning by contrasting positives against negatives (without full normalization) became foundational for contrastive learning methods like SimCLR, MoCo, and CLIP in computer vision and multimodal AI It was only used for word embeddings and had no broader impact It influenced the design of attention mechanisms in transformers

Negative Sampling & Phrase Vectors