Mikolov, Sutskever, Chen, Corrado, Dean (Google) — NIPS 2013

Negative Sampling & Phrase Vectors

Three extensions that made Word2Vec practical at scale: negative sampling replaces the expensive softmax, subsampling discards frequent words, and phrase detection treats "New York" as a single token.

Prerequisites: Word2Vec basics + Softmax + Probability
10
Chapters
7+
Simulations

Chapter 0: The Bottleneck

The original Word2Vec paper showed that word vectors learned by Skip-gram produce stunning results — king − man + woman ≈ queen. But there was a computational elephant in the room: the softmax over the entire vocabulary.

Recall the Skip-gram probability:

P(wO | wI) = exp(v'wO · vwI) / ∑j=1V exp(v'j · vwI)

The numerator is cheap: one dot product. The denominator is catastrophic: V dot products, V exponentials, and a sum. For V = 1 million words, every single training example requires 1 million dot products. With a corpus of 6 billion words and window size c = 5, that is ~60 billion training pairs, each requiring ~1 million operations. This is 6 × 1016 floating-point operations — utterly infeasible.

The first paper introduced hierarchical softmax (O(log V) per step) as a solution. But this follow-up paper presents something even better for Skip-gram: negative sampling. Instead of computing a probability over all V words, you only update a small number of "negative" (incorrect) words alongside the correct one.

The Softmax Bottleneck

Each training step requires scoring ALL V words. With negative sampling, you only score k+1 words (1 positive + k negatives). Drag k to see the cost reduction.

Neg. samples k 5
The question this paper answers: Can we replace the softmax (which normalizes over all V words) with something that only touches a small constant number of words per step — without sacrificing vector quality? The answer is yes, via negative sampling. And while we're at it, can we also handle phrases ("New York", "ice cream") as single tokens? Also yes, via a clever statistical detection algorithm.

Timeline of the problem

Let's put the cost in concrete terms. Training Skip-gram with full softmax on 6 billion words:

StepOperationsTime estimate
Per training pairV × d = 106 × 300 = 3 × 108 FLOPs~1 ms
Per word position2c pairs × 3 × 108 = 3 × 109 FLOPs~10 ms
Full corpus6 × 109 positions × 3 × 109 = 1.8 × 1019 FLOPs~570 years on 1 CPU core

With negative sampling (k=5): each training pair costs (k+1) × d = 1800 FLOPs. The full corpus processes in 6 × 109 × 10 × 1800 = 1.08 × 1014 FLOPs ≈ ~1 day on a multi-core machine. That is a factor of 166,000x reduction in compute.

Three approaches to the bottleneck

The field explored three solutions, each with different tradeoffs:

MethodCost per stepExact?Best for
Full softmaxO(V · d)YesSmall vocabularies only
Hierarchical softmaxO(log V · d)YesCBOW, moderate vocab
NCEO(k · d)Approx (converges)Density estimation
Negative samplingO(k · d)No (but better vectors)Skip-gram, large data

The paper showed that NEG — the least theoretically rigorous method — produces the best word vectors. Sometimes practical performance matters more than theoretical guarantees.

Why is the full softmax denominator the bottleneck in Skip-gram training?

Chapter 1: Noise Contrastive Estimation

Before we get to negative sampling, we need to understand the idea it simplifies: Noise Contrastive Estimation (NCE), introduced by Gutmann and Hyvarinen (2010).

The core idea of NCE is to convert a density estimation problem (estimating P(w|context)) into a binary classification problem: given a word-context pair, is this a real pair from the data, or a fake pair generated by a noise distribution?

The setup

We have two sources of word-context pairs:

  1. Data distribution: Real (center, context) pairs from the corpus. Labeled as positive (D = 1).
  2. Noise distribution: Fake pairs where the context word is drawn randomly from a noise distribution Pn(w). Labeled as negative (D = 0).

For each real pair, we generate k noise samples. So the ratio of real to noise examples is 1 : k.

The NCE objective

We train a binary classifier to distinguish real from noise. Using Bayes' rule, the posterior probability that a pair is real (given the pair and the mixture ratio) is:

P(D=1 | w, c) = P(w|c) / (P(w|c) + k · Pn(w))

NCE parameterizes log P(w|c) with a model (our dot product: v'w · vc) and maximizes the log-likelihood of correctly classifying real vs. noise pairs.

LNCE = log P(D=1 | wO, wI) + ∑i=1k Ewi ~ Pn[log P(D=0 | wi, wI)]
Why NCE works: Gutmann and Hyvarinen proved that as k → ∞, the NCE estimator converges to the true data distribution. With finite k, it's an approximation — but a very good one. The key insight is that you don't need to normalize over all V words. You just need to distinguish real data from noise. The normalizing constant is implicitly learned as part of the model parameters.

NCE's guarantee

NCE is statistically consistent: it approximates the true softmax distribution. As k increases, the approximation improves. Gutmann and Hyvarinen showed that for k = V (one noise sample per vocabulary word), NCE recovers the exact maximum likelihood estimate. But the whole point is to use k ≪ V.

The NCE loss in detail

Let's write the full NCE loss for a single (center, context) pair. Denote the model's log-probability as s(w, c) = v'w · vc (the unnormalized score). NCE treats the normalization constant Z as a learnable parameter (or fixes it to 1). The posterior is:

P(D=1 | w, c) = exp(s(w,c)) / (exp(s(w,c)) + k · Pn(w))

The loss for one positive pair and k noise samples is:

L = −log P(D=1 | w+, c) − ∑i=1k log P(D=0 | wi, c)
= −log P(D=1 | w+, c) − ∑i=1k log(1 − P(D=1 | wi, c))

Notice that NCE needs to compute log Pn(w) for each noise sample inside the posterior. This is an extra lookup step. Negative sampling drops this entirely.

NCE: Real vs. Noise Classification

For each real (center, context) pair, generate k noise pairs by sampling random words. Train a classifier to distinguish them. More noise samples = better approximation.

Noise samples k 5
What is the core idea of Noise Contrastive Estimation?

Chapter 2: Negative Sampling

NCE is theoretically beautiful but carries overhead: it needs to compute the noise distribution probability Pn(w) and the posterior ratio at each step. Mikolov et al. asked: for learning word vectors, do we even need the full NCE guarantee?

The answer: no. Since we only care about vector quality (not recovering the exact probability distribution), we can simplify NCE further. The result is Negative Sampling (NEG) — a stripped-down variant that works spectacularly well in practice.

From NCE to NEG

NCE's classifier computes:

P(D=1 | w, c) = σ(v'w · vc − log(k · Pn(w)))

Negative sampling simplifies this by dropping the log(k · Pn(w)) correction term:

P(D=1 | w, c) ≈ σ(v'w · vc)

This is a raw sigmoid on the dot product. No noise distribution correction. No normalization constant. Just: "are these two vectors compatible?"

Why the simplification works

The correction term log(k · Pn(w)) compensates for the fact that frequent words appear more often as noise samples. Dropping it means the model won't recover the exact softmax probabilities. But for word vector quality, this doesn't matter. The embeddings are defined by the relative geometry of vectors, not by the exact probabilities. Negative sampling learns the same geometry as full softmax — it just doesn't guarantee the same probabilities.

The trade-off: NCE is an approximation to the softmax that converges to the exact answer as k → ∞. Negative sampling is an approximation to NCE that never converges to the exact softmax, even with infinite k. But it produces vectors of equal or better quality with less computation. Sometimes the right answer is to optimize for what you actually want (good vectors) rather than a proxy (exact probabilities).

The noise distribution

Both NCE and NEG need a noise distribution Pn(w) to sample negative words from. The paper found that the unigram distribution raised to the 3/4 power works best:

Pn(w) = f(w)3/4 / ∑j f(j)3/4

where f(w) is the frequency of word w in the corpus.

Why 3/4? Raw unigram sampling (exponent 1) samples too many frequent words like "the," "a," "is." Uniform sampling (exponent 0) samples too many rare words. The 3/4 power is a sweet spot that slightly over-represents rare words relative to their frequency, giving the model more informative negatives.

Noise Distribution: Unigramα

Compare sampling probabilities under different exponents α. At α=3/4, rare words get boosted relative to raw frequency, giving more informative negatives.

Exponent α 0.75
python
import numpy as np

def build_noise_distribution(word_freqs, alpha=0.75):
    """
    word_freqs: array of word frequencies, shape (V,)
    alpha: exponent (0.75 = paper's recommended value)
    Returns: sampling probabilities, shape (V,)
    """
    # Raise frequencies to the 3/4 power
    powered = word_freqs ** alpha     # shape: (V,)

    # Normalize to get a probability distribution
    noise_dist = powered / powered.sum()

    return noise_dist

# Example: "the" is 100x more frequent than "queen"
freqs = np.array([1000, 500, 200, 50, 10])  # the, of, cat, queen, zinc
words = ["the", "of", "cat", "queen", "zinc"]

raw = freqs / freqs.sum()             # [0.568, 0.284, 0.114, 0.028, 0.006]
neg = build_noise_distribution(freqs)  # [0.425, 0.259, 0.140, 0.058, 0.018] — rare words boosted

for w, r, n in zip(words, raw, neg):
    print(f"{w:8s}  raw: {r:.3f}  neg: {n:.3f}  boost: {n/r:.2f}x")
Why does negative sampling use f(w)3/4 instead of raw word frequency for the noise distribution?

Chapter 3: The NEG Objective

Now we can write the full negative sampling objective. This is the equation that replaced the softmax and made Word2Vec trainable on billions of words.

The objective

For a (center word wI, context word wO) pair, the NEG objective is:

LNEG = log σ(v'wO · vwI) + ∑i=1k Ewi ~ Pn[log σ(−v'wi · vwI)]

Let's unpack each term:

Term 1: log σ(v'wO · vwI) — the positive term. The sigmoid σ(x) = 1/(1 + exp(−x)). This is high when v'wO and vwI point in similar directions (large positive dot product). The loss encourages the output vector of the true context word to be close to the input vector of the center word.

Term 2: ∑ log σ(−v'wi · vwI) — the negative terms. Note the minus sign inside the sigmoid. σ(−x) = 1 − σ(x). This is high when v'wi and vwI point in different directions (small or negative dot product). The loss encourages the output vectors of noise words to be far from the input vector of the center word.

In plain English: "Push the real context word's vector toward the center word's vector. Push k randomly chosen words' vectors away from the center word's vector." That's it. No normalization. No partition function. No sum over V. Just k+1 dot products and k+1 sigmoid operations per training step.

The gradient

The gradient with respect to the output vector v'w is remarkably simple:

∂L/∂v'wO = (σ(v'wO · vwI) − 1) · vwI
∂L/∂v'wi = σ(v'wi · vwI) · vwI

For the positive word: the gradient is proportional to (prediction − 1) · input vector. It pushes v'wO toward vwI. For each negative word: the gradient is proportional to (prediction − 0) · input vector. It pushes v'wi away from vwI.

Computational cost

Full softmax: O(V · d) per step
Negative sampling: O(k · d) per step

For V = 1,000,000 and k = 5: that is a 200,000x speedup. The paper recommends k = 5-20 for large datasets and k = 5-50 for small datasets.

Worked example: one NEG step

Let d = 3, center = "sat" with vsat = [0.5, −0.2, 0.3]. True context = "cat" with v'cat = [0.1, 0.4, −0.1]. Negative sample = "tree" with v'tree = [−0.3, 0.1, 0.6].

Positive term:

dot+ = v'cat · vsat = 0.05 − 0.08 − 0.03 = −0.06
σ(−0.06) = 0.485
L+ = −log(0.485) = 0.724

Gradient: (σ(−0.06) − 1) · vsat = −0.515 · [0.5, −0.2, 0.3] = [−0.258, 0.103, −0.155]. This pushes v'cat toward vsat.

Negative term:

dot = v'tree · vsat = −0.15 − 0.02 + 0.18 = 0.01
σ(0.01) = 0.5025
L = −log(1 − 0.5025) = −log(0.4975) = 0.698

Gradient: σ(0.01) · vsat = 0.5025 · [0.5, −0.2, 0.3] = [0.251, −0.101, 0.151]. This pushes v'tree away from vsat.

Total loss: L = 0.724 + 0.698 = 1.422. After updating with learning rate 0.1, v'cat moves 0.026 closer to vsat and v'tree moves 0.025 away.

Negative Sampling Training Step

For center word "sat", the true context "cat" is pushed closer (green arrows), while k random negative words are pushed away (red arrows). Click "Train Step" to see an update.

k 5
Step 0
python
import numpy as np

def neg_sampling_loss(center_id, context_id, neg_ids, W_in, W_out):
    """
    center_id:  int — center word
    context_id: int — true context word (positive)
    neg_ids:    list of int — k negative samples
    W_in:  (V, d) — input embeddings
    W_out: (V, d) — output embeddings
    Returns: loss, gradient for W_in[center_id], gradients for W_out rows
    """
    v_c = W_in[center_id]            # (d,) — center embedding

    # Positive term: push context word toward center
    v_pos = W_out[context_id]        # (d,)
    dot_pos = v_pos @ v_c            # scalar
    sig_pos = 1 / (1 + np.exp(-dot_pos))
    loss = -np.log(sig_pos + 1e-10)

    # Gradient for positive output vector
    grad_pos = (sig_pos - 1) * v_c   # (d,) — points toward center
    grad_center = (sig_pos - 1) * v_pos

    # Negative terms: push noise words away from center
    for neg_id in neg_ids:
        v_neg = W_out[neg_id]         # (d,)
        dot_neg = v_neg @ v_c         # scalar
        sig_neg = 1 / (1 + np.exp(-dot_neg))
        loss -= np.log(1 - sig_neg + 1e-10)

        # Gradient: push this negative word AWAY from center
        grad_neg = sig_neg * v_c      # (d,)
        W_out[neg_id] -= lr * grad_neg
        grad_center += sig_neg * v_neg

    # Update
    W_out[context_id] -= lr * grad_pos
    W_in[center_id] -= lr * grad_center

    return loss

# Total cost per step: (k+1) dot products + (k+1) sigmoid operations
# For k=5, d=300: 6 * 300 = 1,800 FLOPs vs. 1,000,000 * 300 for softmax

Complete PyTorch implementation

python
import torch
import torch.nn as nn
import numpy as np
from collections import Counter

class Word2VecDataset(torch.utils.data.Dataset):
    def __init__(self, corpus, vocab, window=5, k=5, t=1e-5):
        """
        corpus: list of list of int (tokenized sentences)
        vocab: dict mapping word -> index
        """
        self.k = k
        self.pairs = []

        # Build frequency table for subsampling + noise dist
        freq = Counter()
        total = 0
        for sent in corpus:
            for w in sent:
                freq[w] += 1
                total += 1

        # Subsampling probabilities
        self.keep_prob = {}
        for w, c in freq.items():
            f = c / total
            self.keep_prob[w] = np.sqrt(t / f) if f > t else 1.0

        # Noise distribution: f(w)^(3/4)
        freqs = np.array([freq.get(i, 0) for i in range(len(vocab))])
        powered = freqs ** 0.75
        self.noise_dist = powered / powered.sum()

        # Generate training pairs with subsampling
        for sent in corpus:
            # Apply subsampling
            filtered = [w for w in sent if np.random.random() < self.keep_prob.get(w, 1)]
            for i, center in enumerate(filtered):
                for j in range(max(0, i-window), min(len(filtered), i+window+1)):
                    if i != j:
                        self.pairs.append((center, filtered[j]))

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        center, context = self.pairs[idx]
        # Sample k negatives from noise distribution
        negatives = np.random.choice(len(self.noise_dist), size=self.k, p=self.noise_dist)
        return (torch.tensor(center), torch.tensor(context),
                torch.tensor(negatives, dtype=torch.long))


class SkipGramNEG(nn.Module):
    def __init__(self, V, d=300):
        super().__init__()
        self.W_in = nn.Embedding(V, d)
        self.W_out = nn.Embedding(V, d)
        nn.init.uniform_(self.W_in.weight, -0.5/d, 0.5/d)
        nn.init.zeros_(self.W_out.weight)

    def forward(self, center, context, negatives):
        v_c = self.W_in(center)                    # (B, d)
        v_pos = self.W_out(context)                # (B, d)
        v_neg = self.W_out(negatives)              # (B, k, d)

        pos_score = (v_c * v_pos).sum(1)           # (B,)
        pos_loss = -torch.nn.functional.logsigmoid(pos_score)

        neg_score = torch.bmm(v_neg, v_c.unsqueeze(2)).squeeze(2)  # (B, k)
        neg_loss = -torch.nn.functional.logsigmoid(-neg_score).sum(1)

        return (pos_loss + neg_loss).mean()

# Training loop
model = SkipGramNEG(V=10000, d=300)
optimizer = torch.optim.SparseAdam(model.parameters(), lr=0.025)

for epoch in range(5):
    for center, context, negatives in dataloader:
        loss = model(center, context, negatives)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Extract word vectors
word_vectors = model.W_in.weight.detach().numpy()  # (V, 300)
Sparse updates matter. In Skip-gram NEG, each training step only modifies k+2 embedding rows (1 center + 1 positive + k negatives). PyTorch's SparseAdam exploits this sparsity for faster training. Standard Adam would update all V rows due to how momentum works, wasting computation on untouched embeddings.

The learning rate schedule

The original Word2Vec implementation uses a linear learning rate decay: the learning rate starts at 0.025 and decreases linearly to near zero over the course of training. This is unusual — most modern systems use constant LR with warm-up, or cosine decay. But linear decay works well for Word2Vec because:

python
# Linear learning rate decay in Word2Vec
def get_lr(current_word, total_words, initial_lr=0.025, min_lr=0.0001):
    """Linear decay from initial_lr to min_lr over the corpus."""
    progress = current_word / total_words
    return max(min_lr, initial_lr * (1 - progress))

# At the start:  lr = 0.025 (big steps, fast learning)
# At 50% done:   lr = 0.0125 (moderate steps)
# At 99% done:   lr = 0.00025 (tiny adjustments)

Multi-threaded training

The original C implementation uses Hogwild-style asynchronous SGD: multiple threads read from the same corpus file at different positions and update the shared embedding matrices without locks. This sounds chaotic — won't threads overwrite each other's updates? In practice, it works because:

Memory layout for performance

The original C implementation stores word vectors as a contiguous float array of size V × d. Each vector lookup is pointer arithmetic: syn0 + word_id * d. No hash table, no indirection. For V = 1M and d = 300 in float32, the two embedding matrices total 2 × 1M × 300 × 4 = 2.4 GB — comfortably fitting in RAM on 2013 hardware.

Modern PyTorch implementations use nn.Embedding which has the same memory layout but adds autograd bookkeeping. For maximum training speed, the original C code outperforms PyTorch by ~3-5x due to avoiding Python overhead and using raw BLAS operations.

The word2vec.c implementation

The original implementation by Mikolov is a single C file of ~650 lines. Key design decisions:

c
// Simplified core of word2vec.c (Skip-gram with NEG)
// The actual training loop for one (center, context) pair

// Positive example: push context toward center
for (c = 0; c < layer1_size; c++)
  neu1e[c] = 0;

for (d = 0; d < negative + 1; d++) {
  if (d == 0) {
    target = word;    // positive: true context word
    label = 1;
  } else {
    target = table[rand() % table_size];  // O(1) negative sample
    label = 0;
  }
  l2 = target * layer1_size;  // pointer to output vector
  f = 0;
  for (c = 0; c < layer1_size; c++)
    f += syn0[l1 + c] * syn1neg[l2 + c];  // dot product
  g = (label - expTable[sigmoid_index(f)]) * alpha;  // gradient
  for (c = 0; c < layer1_size; c++)
    neu1e[c] += g * syn1neg[l2 + c];  // accumulate input gradient
  for (c = 0; c < layer1_size; c++)
    syn1neg[l2 + c] += g * syn0[l1 + c];  // update output vector
}
// Update input vector
for (c = 0; c < layer1_size; c++)
  syn0[l1 + c] += neu1e[c];
What does the negative sampling objective do in plain English?

Chapter 4: Subsampling Frequent Words

In a typical English corpus, words like "the," "a," "in," and "is" appear millions of times. The word "the" might account for 7% of all tokens. This creates two problems for Word2Vec:

  1. Wasted computation. The pair ("the", "cat") appears thousands of times. After the first few hundred, the model has learned everything it can from this pair. The remaining thousands of examples contribute almost zero gradient but still cost compute.
  2. Semantic dilution. "The" appears in the context of every word. Its context is so broad that it carries almost no information about any specific word. Including "the" in every window adds noise rather than signal.

The paper's solution: randomly discard frequent words during training, with a probability that depends on word frequency.

The subsampling formula

Each word w in the training corpus is discarded with probability:

P(discard w) = 1 − √(t / f(w))

where f(w) is the frequency of word w (fraction of total tokens) and t is a threshold, typically t = 10−5.

How this works in practice

Consider a word with frequency f(w) = 0.01 (1% of corpus, like "of") and threshold t = 10−5:

P(discard "of") = 1 − √(10−5 / 0.01) = 1 − √0.001 = 1 − 0.032 = 0.968

So 96.8% of occurrences of "of" are discarded! Only 3.2% survive.

For a word with frequency f(w) = 10−5 (rare word, same as threshold):

P(discard rare) = 1 − √(10−5 / 10−5) = 1 − 1 = 0

Rare words are never discarded.

Why √ and not linear? The square root creates a smooth transition. A linear discard (proportional to frequency) would be too aggressive — it would discard most occurrences of moderately common words. The square root is gentler: even very common words retain a meaningful fraction of their occurrences, but they no longer dominate the training data.
Subsampling by Frequency

The bar chart shows how many occurrences survive subsampling. Frequent words (left) are aggressively pruned; rare words (right) are untouched. Drag the threshold t.

Threshold t 10-5
python
import numpy as np

def subsample(word_freqs, threshold=1e-5):
    """
    word_freqs: dict mapping word -> frequency (fraction of total tokens)
    Returns: dict mapping word -> keep probability
    """
    keep_probs = {}
    for word, freq in word_freqs.items():
        if freq > threshold:
            keep_probs[word] = np.sqrt(threshold / freq)
        else:
            keep_probs[word] = 1.0  # rare words: always keep
    return keep_probs

# Example
freqs = {"the": 0.07, "of": 0.03, "cat": 0.0001, "queen": 0.00001}
keep = subsample(freqs)
# the: 1.2% kept, of: 1.8% kept, cat: 31.6% kept, queen: 100% kept

The double benefit

Subsampling helps in two ways simultaneously:

  1. Speed: The effective corpus size shrinks dramatically. With aggressive subsampling, training is 2-10x faster.
  2. Quality: By removing "the" from most context windows, the remaining context words are more informative. The paper reports that subsampling improved analogy accuracy by several percentage points — it's not just faster, it's better.

Worked example: effect on context windows

Original sentence: "the cat sat on the fluffy mat"

With t = 10−5, most occurrences of "the" and "on" are discarded. A likely surviving version:

After subsampling: "cat sat fluffy mat"

Now with window c = 2, the context of "sat" is {cat, fluffy, mat} instead of {the, cat, on, the}. Every context word is meaningful. "The" and "on" added only noise — they co-occur with everything and carry no discriminative information.

Before and After Subsampling

Frequent words like "the" and "on" are removed, leaving only informative context words. The effective window now connects content words that were previously separated by function words.

Effective window expansion

Subsampling has an additional hidden benefit: it effectively increases the context window. When frequent words are removed, distant words that were outside the original window now become visible. If "the" between "cat" and "fluffy" is removed, those words become adjacent, effectively making c larger for rare, informative words.

Numbers from the paper: On the Google News corpus (6B tokens), subsampling with t = 10−5 discarded 78% of "the" occurrences but only 0.3% of "queen." The effective corpus shrank from 6B to ~2.5B tokens, while analogy accuracy improved from 56% to 61%. Training time dropped proportionally — a rare "free lunch" in ML.

Subsampling formula derivation

The subsampling formula P(discard) = 1 − √(t/f(w)) was chosen empirically, but it has an intuitive derivation. We want a keep probability that:

The √(t/f) formula approximately satisfies the last condition. If word w appears N · f(w) times in a corpus of N tokens, after subsampling it appears approximately:

N · f(w) · √(t / f(w)) = N · √(t · f(w))

For very frequent words (f ≫ t), the surviving count grows as √f — much slower than linear f. For a word appearing 107 times, only about 103.5 ≈ 3,162 occurrences survive. This is still enough to learn a good vector, but doesn't waste compute on redundant examples.

python
# Visualizing the effect of subsampling on a Zipfian distribution
import numpy as np

# Zipf's law: frequency of rank-r word ≈ C/r
V = 100000
ranks = np.arange(1, V+1)
freqs = 1.0 / ranks
freqs = freqs / freqs.sum()  # normalize to probabilities

t = 1e-5
keep_probs = np.sqrt(t / freqs)
keep_probs = np.clip(keep_probs, 0, 1)

# Top 10 words: original vs. effective frequency
for i in range(10):
    print(f"Rank {i+1}: freq={freqs[i]:.6f} keep={keep_probs[i]:.4f} effective={freqs[i]*keep_probs[i]:.6f}")

# Output shows that rank-1 word goes from 8.3% to 0.03% of effective corpus
# while rank-1000 word goes from 0.008% to 0.003% — much less dramatic
Why does subsampling frequent words actually improve word vector quality, not just speed up training?

Chapter 5: Phrase Detection

Word2Vec treats every token as atomic, but many natural concepts span multiple words. "New York" is not the sum of "New" and "York." "Ice cream" has nothing to do with "ice" or "cream" individually. "Machine learning" is a concept, not two separate words.

The paper introduces a simple, elegant algorithm to automatically detect phrases in the corpus and treat them as single tokens.

The phrase scoring function

For each pair of adjacent words (wi, wj), compute a score based on pointwise mutual information (PMI):

score(wi, wj) = (count(wi, wj) − δ) / (count(wi) · count(wj))

Where:

The intuition

If "New" appears 500,000 times and "York" appears 200,000 times in a corpus of 1 billion words, random chance would predict them appearing together roughly (500,000/109) × (200,000/109) × 109 = 100 times. If "New York" actually appears 150,000 times, it's occurring 1,500x more often than chance predicts. That is a phrase.

Meanwhile, "this is" might appear 500,000 times, but "this" appears 10 million times and "is" appears 8 million times. By chance alone, we'd expect (10M/109) × (8M/109) × 109 = 80,000. So "this is" is only ~6x above chance. Not a phrase.

Iterative detection

The algorithm runs multiple passes. In the first pass, it detects bigrams ("New_York", "ice_cream"). In the second pass, it can detect trigrams by combining bigrams with adjacent words ("New_York_City"). In practice, 2-4 passes capture most useful phrases.

Worked example: scoring bigrams

Corpus size: 1 billion words. Consider two candidate phrases:

"New York":

count("New") = 500,000,   count("York") = 200,000,   count("New York") = 150,000
score = (150,000 − 5) / (500,000 × 200,000) × 109 = 150,000 / 1011 × 109 = 1,500

Score of 1,500 is far above any reasonable threshold. "New York" is definitely a phrase.

"this is":

count("this") = 10,000,000,   count("is") = 8,000,000,   count("this is") = 500,000
score = (500,000 − 5) / (10,000,000 × 8,000,000) × 109 = 500,000 / 8 × 1013 × 109 = 6.25

Score of 6.25 is well below typical thresholds (δ ≈ 100). "This is" is just two common words that happen to co-occur.

The δ discounting parameter

The δ parameter in the numerator (count(wi, wj) − δ) serves as a minimum count threshold. If a bigram appears fewer than δ times, the numerator becomes negative and the pair is automatically rejected. This prevents statistical flukes from creating spurious phrases. The paper uses δ = 5 as a default.

The δ parameter: The discounting threshold δ prevents rare word pairs from being treated as phrases. If two rare words happen to appear together once or twice, their PMI will be very high (since their expected co-occurrence is near zero) but the evidence is too weak to justify calling it a phrase. δ effectively requires a minimum count before a bigram can be considered.
Phrase Detection via PMI

Bigram scores based on pointwise mutual information. High scores (teal) indicate genuine phrases; low scores (dim) are coincidental word pairs. Drag the threshold to see which bigrams qualify.

Threshold 50
python
def detect_phrases(corpus, min_count=5, threshold=100, delta=5):
    """
    corpus: list of list of str (tokenized sentences)
    Returns: corpus with phrases joined by '_'
    """
    from collections import Counter

    # Count unigrams and bigrams
    unigrams = Counter()
    bigrams = Counter()
    total = 0

    for sentence in corpus:
        for i, word in enumerate(sentence):
            unigrams[word] += 1
            total += 1
            if i < len(sentence) - 1:
                bigrams[(word, sentence[i+1])] += 1

    # Score each bigram
    phrases = set()
    for (w1, w2), count in bigrams.items():
        if count < min_count:
            continue
        score = (count - delta) / (unigrams[w1] * unigrams[w2]) * total
        if score > threshold:
            phrases.add((w1, w2))

    # Replace bigrams with phrases
    new_corpus = []
    for sentence in corpus:
        new_sent = []
        i = 0
        while i < len(sentence):
            if i < len(sentence) - 1 and (sentence[i], sentence[i+1]) in phrases:
                new_sent.append(sentence[i] + "_" + sentence[i+1])
                i += 2
            else:
                new_sent.append(sentence[i])
                i += 1
        new_corpus.append(new_sent)

    return new_corpus
How does the phrase detection algorithm decide that "New York" is a phrase but "this is" is not?

Chapter 6: Phrase Vectors

Once phrases are detected and the corpus is preprocessed (replacing "New" "York" with "New_York"), we simply train Word2Vec as before. The phrase tokens are treated exactly like any other word. They get their own embedding vectors, learned from context just like single words.

What phrase vectors learn

The beautiful thing is that phrase vectors capture meaning that is not the sum of their parts. The vector for "New_York" is not simply vec("New") + vec("York"). It's a new vector that sits in its own location in the embedding space, near other city names and geographic entities.

Examples from the paper:

PhraseNearest neighbors
New_YorkNew_York_City, Los_Angeles, Chicago, Manhattan
ice_creamfrozen_yogurt, gelato, dessert, sorbet
machine_learningdeep_learning, NLP, computer_vision, AI
Barack_ObamaObama, president, White_House, Biden
hot_doghamburger, sandwich, sausage, pizza

Phrase analogies

The analogy property extends to phrases:

vec("Montreal_Canadiens") − vec("Montreal") + vec("Toronto") ≈ vec("Toronto_Maple_Leafs")
vec("German") − vec("Germany") + vec("France") ≈ vec("French")

This works because phrases, like words, live in the same vector space and are learned from the same distributional signal. A phrase's context defines its meaning, just like a word's context defines its meaning.

Vocabulary expansion: Phrase detection dramatically increases the effective vocabulary without increasing model complexity. A corpus with 1M unique words might yield 2M unique tokens (words + phrases) after phrase detection. Each token gets its own d-dimensional vector. The embedding matrix grows from (1M, d) to (2M, d), but the per-step cost stays O(k · d) with negative sampling — independent of vocabulary size.

Phrase detection statistics

Running phrase detection on the Google News corpus (33B tokens):

PassNew phrases foundExample
1st pass (bigrams)~500,000New_York, ice_cream, machine_learning
2nd pass (trigrams)~200,000New_York_City, ice_cream_cone, New_York_Times
3rd pass (4-grams)~50,000New_York_Stock_Exchange
4th pass~10,000Diminishing returns

The original vocabulary of ~1M words expanded to ~3M tokens (words + phrases). But only ~500K of these phrases appeared frequently enough to learn reliable vectors.

Phrase analogies work too

The paper discovered that phrase vectors support the same analogy arithmetic as word vectors:

vec("Montreal_Canadiens") − vec("Montreal") + vec("Toronto") ≈ vec("Toronto_Maple_Leafs")
vec("Steve_Ballmer") − vec("Microsoft") + vec("Apple") ≈ vec("Steve_Jobs")
vec("New_Zealand") − vec("Wellington") + vec("Canberra") ≈ vec("Australia")

These analogies work because phrases live in the same vector space as words and are trained with the same objective. The distributional hypothesis applies equally: "Montreal_Canadiens" appears in similar contexts to other hockey teams.

Phrase Embedding Space

Phrases cluster near semantically related words and other phrases. "New_York" is near other cities, not near "new" or "york" individually. Click a cluster to highlight.

Why is the phrase vector for "New_York" different from vec("New") + vec("York")?

Chapter 7: Compositionality

One of the most fascinating findings of the paper is about additive compositionality: the meaning of short phrases can sometimes be approximated by vector addition of their component words.

This seems to contradict what we just said about "New_York" ≠ vec("New") + vec("York"). So what's going on?

Two types of phrases

The paper distinguishes between non-compositional and compositional phrases:

TypeExamplevec(phrase) ≈ sum of parts?
Non-compositional"hot dog", "New York", "kick the bucket"No — meaning unrelated to parts
Compositional"strong coffee", "big city", "fast car"Yes — meaning is roughly additive

For compositional phrases, the paper shows that:

vec("Czech") + vec("currency") ≈ vec("koruna")
vec("Vietnam") + vec("capital") ≈ vec("Hanoi")

Why does addition work for compositional meaning?

The paper offers an intuitive explanation rooted in the training objective. In Skip-gram with negative sampling, the training signal for a word is determined by its context. If two words w1 and w2 each independently predict a set of context words, then the vector v1 + v2 will have high dot product with context words of both w1 and w2.

The sum vec("Czech") + vec("currency") activates context words related to both Czech things and currency things. The word that best matches this intersection is "koruna" — the Czech currency. The addition acts as an AND operation in the semantic space.

Limitation: Additive compositionality is approximate and breaks down for long phrases, idiomatic expressions, and negation. vec("not good") ≠ vec("bad"). vec("kicked the bucket") ≠ vec("died"). These require more sophisticated composition functions — which later models like LSTMs and Transformers would provide. Word2Vec's additive compositionality is a beautiful special case, not a general theory of meaning composition.

Compositionality failure modes

Understanding when addition fails is as instructive as understanding when it works:

QueryExpectedActual nearestWhy it fails
not + goodbadgreatNegation is not additive; "not" doesn't negate in vector space
hot + doghot_dogpuppyIdiomatic meaning unrelated to component words
red + herringdistractionsalmonMetaphorical usage; literal meanings dominate
kick + bucketdiepailIdiom; literal combination wins

The pattern: addition works for compositional meanings (Czech + currency = koruna) but fails for idiomatic, metaphorical, or negated meanings. This is a fundamental limitation of static word vectors — they encode distributional similarity, not compositional semantics.

Why addition works: a formal argument

In Skip-gram, the training objective maximizes the dot product between co-occurring words. If words A and B independently co-occur with a set of words S, then:

vA · v's ≈ log P(s | A) for s ∈ S(A)
vB · v's ≈ log P(s | B) for s ∈ S(B)

The sum vector vA + vB has high dot product with any word s that is in both S(A) and S(B):

(vA + vB) · v's = vA · v's + vB · v's

This sum is large only when both terms are large — i.e., when s co-occurs with both A and B. This is an AND operation in the embedding space. The word whose context best matches the intersection of A's and B's contexts will have the highest similarity to vA + vB.

Worked example

Why does vec("Czech") + vec("currency") ≈ vec("koruna")?

This only works because "koruna" is genuinely at the intersection of Czech-related and currency-related contexts. For non-compositional phrases like "hot dog," the intersection of hot-related and dog-related contexts gives "puppy" or "warm," not "hamburger."

Compositionality test set

The paper created a test set of ~3,000 compositional queries to evaluate phrase vectors:

CategoryExample queryExpected answerAccuracy
Country + currencyCzech + currencykoruna62%
Country + capitalVietnam + capitalHanoi58%
Country + languageJapan + languageJapanese71%
City + stateDallas + stateTexas49%
Person + fieldEinstein + fieldphysics45%

Country-language pairs work best (71% accuracy) because the relationship is consistent and well-represented in news text. City-state and person-field are harder because the associations are more ambiguous.

python
# Testing compositionality
def test_composition(word_a, word_b, expected, word_vectors, top_k=10):
    """Test if vec(a) + vec(b) ≈ vec(expected)"""
    target = word_vectors[word_a] + word_vectors[word_b]
    target = target / (np.linalg.norm(target) + 1e-10)

    # Find nearest neighbors to the sum
    sims = {}
    for word, vec in word_vectors.items():
        if word in {word_a, word_b}:
            continue
        vec_norm = vec / (np.linalg.norm(vec) + 1e-10)
        sims[word] = np.dot(target, vec_norm)

    ranked = sorted(sims.items(), key=lambda x: -x[1])[:top_k]
    rank = None
    for i, (w, s) in enumerate(ranked):
        if w == expected:
            rank = i + 1
            break

    print(f"{word_a} + {word_b} → {ranked[0][0]} (expected: {expected}, rank: {rank})")
    return rank == 1  # Correct if expected word is #1

# Example
test_composition("Czech", "currency", "koruna", vectors)
# Czech + currency → koruna (expected: koruna, rank: 1)
Additive Composition Explorer

vec(A) + vec(B) ≈ vec(C). The sum of two word vectors lands near the word that captures their intersection. Try different pairs to see which compositions work.

Why does vec("Czech") + vec("currency") approximate vec("koruna")?

Chapter 8: Results

The paper evaluates the three extensions (negative sampling, subsampling, phrases) on the word analogy task from the original Word2Vec paper, extended with a phrase analogy section.

Negative sampling vs. alternatives

MethodSemantic Acc %Syntactic Acc %Total Acc %Training Speed
Hierarchical Softmax5355541x
NCE (k=5)525453~1.3x
NEG (k=5)616161~1.5x
NEG (k=15)635861~1x

Negative sampling with k=5 outperforms both hierarchical softmax and NCE on analogy accuracy, while being the fastest to train. The simpler objective (no correction terms, no tree) learns better vectors.

Effect of subsampling

SubsamplingSemantic Acc %Syntactic Acc %Speed
No subsampling56541x
t = 10-56161~2.5x

Subsampling improves both speed (2.5x) and accuracy (5+ points). This is the rare optimization that makes everything better.

Phrase analogy results

The paper extends the analogy test to include phrases. Example phrase analogies:

City analogy
Montreal_Canadiens : Montreal :: Toronto : Toronto_Maple_Leafs
CEO analogy
Steve_Ballmer : Microsoft :: Tim_Cook : Apple
Country analogy
New_Zealand : Wellington :: Australia : Canberra

The best model (Skip-gram, NEG k=5, subsampling t=10−5, phrase detection, d=300, trained on 33 billion words) achieved 72% accuracy on word analogies and 66% on phrase analogies.

Method Comparison: Accuracy vs. Training Speed

Negative sampling achieves the best accuracy-speed tradeoff. Bubble size represents training speed (larger = faster).

Training scale: The final model was trained on an internal Google News dataset of 33 billion words — 5x larger than the original paper's 6B. With negative sampling (k=5) and subsampling (t=10-5), this massive dataset could be processed in a single day on a multi-core machine. The combination of all three techniques (NEG + subsampling + phrases) increased analogy accuracy from 54% to 72% — a relative improvement of 33%.

Choosing k: how many negative samples?

The paper's findings on the optimal number of negative samples k:

Dataset sizeRecommended kReasoning
Small (<100M words)15-50Fewer data points; need more negatives per step to get sufficient gradient signal
Medium (100M-1B)5-15Moderate data; k=5-10 gives a good speed-quality tradeoff
Large (>1B words)2-5Abundance of data means each step can be cheaper; the sheer volume of steps compensates

The intuition: with more data, you see more unique (center, context) pairs. Each pair teaches the model something. With less data, you see the same pairs repeatedly, so you need more negatives per pair to extract sufficient information from each step.

The 3/4 exponent experiment

The paper tested different noise distribution exponents:

Exponent αAnalogy accuracy %
0 (uniform)52
0.2556
0.558
0.7561
1.0 (unigram)57

The 3/4 power consistently outperformed both uniform and pure unigram sampling. This value has since been adopted as a standard across many contrastive learning methods.

Ablation study: contribution of each trick

The paper's results allow us to estimate the contribution of each innovation:

ConfigurationAccuracy %Contribution
Hier. softmax only54Baseline
+ Negative sampling (k=5)61+7 points (better training signal)
+ Subsampling (t=10−5)66+5 points (cleaner context)
+ Phrase detection68+2 points (multi-word concepts)
+ More data (33B vs 6B)72+4 points (scale always helps)

Negative sampling and subsampling together contribute +12 points — more than all other improvements combined. This reinforces the paper's central message: getting the training dynamics right (what to predict, what to ignore) matters more than model architecture.

Sensitivity to hyperparameters

The paper found the method to be remarkably robust to hyperparameter choices:

The most sensitive hyperparameter is the learning rate and its decay schedule. Setting the initial LR too high causes divergence; too low causes slow convergence and poor vectors.

Why NEG beats hierarchical softmax on Skip-gram

An interesting asymmetry: hierarchical softmax works best with CBOW, while negative sampling works best with Skip-gram. Why?

The impact on NLP research 2013-2018

The two Word2Vec papers (Mikolov et al., 2013a, 2013b) are among the most cited papers in AI/NLP. Together they accumulated over 50,000 citations by 2023. Their impact went far beyond word embeddings:

The timeline of influence: Word2Vec (2013) → GloVe (2014) → FastText (2017) → ELMo (2018) → BERT (2018) → GPT-2 (2019) → GPT-3 (2020) → ChatGPT (2022). The thread connecting them all: prediction tasks on raw text yield useful representations. Word2Vec proved the concept; each successor scaled it up.

Comparison: the three Word2Vec training objectives

The two Word2Vec papers introduced three training methods. Here is a detailed comparison of their mathematical properties:

PropertyFull SoftmaxHier. SoftmaxNeg. Sampling
ObjectiveCross-entropy over VProduct of log V sigmoids1 + k binary sigmoids
Exact probabilities?YesYesNo
Cost per stepO(V · d)O(log V · d)O(k · d)
Parameters updatedAll V output vectorslog V node vectorsk + 1 output vectors
Good for rare words?Yes (all words scored)Depends on tree (longer paths for rare)Yes (rare words appear as negatives)
Implicit factorizationSoftmax matrixBinary tree structureShifted PMI matrix
Best paired withSmall vocab onlyCBOWSkip-gram

Practical recommendations from the paper

The paper's final recommendations for practitioners:

  1. Use Skip-gram + negative sampling (k=5) for best quality on most tasks
  2. Use subsampling with t = 10−5 — it's free performance and speed
  3. Use phrase detection (2 passes) if your task involves multi-word entities
  4. Use d = 300 as the default dimension — higher gives diminishing returns
  5. Use window c = 5-10, with c = 10 for semantic tasks and c = 5 for syntactic
  6. More data is always better — prioritize data quantity over model tuning
  7. One epoch is usually sufficient for large corpora (>1B words)

Evaluating phrase vectors

Beyond the analogy task, the paper evaluated phrase vectors on several other benchmarks:

SemEval 2012 Task 2: Measuring relational similarity (e.g., "mason is to stone as carpenter is to wood"). The phrase-enhanced model achieved higher Spearman correlation with human judgments than word-only models.

Qualitative analysis: The paper showed that phrase vectors cluster sensibly:

These clusters emerge entirely from the text — no knowledge base, no labeled data, no human annotation. The distributional hypothesis works for phrases just as it works for individual words.

Impact on phrase-level NLP

Phrase detection and phrase vectors influenced subsequent work in several ways:

From phrases to subwords: Word2Vec's phrase detection used PMI to find multi-word expressions like "New_York." BPE (used by GPT, LLaMA, etc.) uses a similar idea at the character level — repeatedly merge the most frequent adjacent character pair. Both algorithms build a vocabulary of common sequences from corpus statistics. The intellectual lineage is direct: phrases → subwords → modern tokenization.

The negative sampling objective: a modern perspective

Looking back from 2024, negative sampling's most important contribution wasn't making Word2Vec fast — it was establishing contrastive learning as a paradigm. The insight that "learning by distinguishing positive from negative examples" is sufficient to learn useful representations has been verified in dozens of subsequent papers across NLP, vision, and multimodal AI.

Key lessons from negative sampling that generalize to modern contrastive learning:

The elegance of negative sampling is that it solves two problems at once: it makes training tractable (by avoiding the full softmax) and it provides a natural way to define what "similar" and "dissimilar" mean (through the noise distribution). This dual purpose is why the contrastive learning paradigm has proven so durable.

What was the combined effect of negative sampling, subsampling, and phrase detection on word analogy accuracy?

Chapter 9: Connections

What this paper built on

Word2Vec (Mikolov et al., 2013a): The original paper introducing CBOW and Skip-gram with hierarchical softmax. This follow-up provides the practical extensions that made Word2Vec the dominant word embedding method. See our companion veanor.

Noise Contrastive Estimation (Gutmann & Hyvarinen, 2010): The theoretical foundation for negative sampling. NCE provides the statistical guarantee; NEG simplifies it for the specific case of learning embeddings.

Pointwise Mutual Information: The phrase detection algorithm is based on PMI, a fundamental concept from information theory. Later work by Levy and Goldberg (2014) showed that Skip-gram with negative sampling implicitly factorizes a word-context PMI matrix.

What this paper enabled

GloVe (Pennington et al., 2014): Showed the connection between predictive methods (Word2Vec) and count-based methods (LSA). See our GloVe veanor.

FastText (Bojanowski et al., 2017): Extended Skip-gram with negative sampling to use character n-grams, enabling out-of-vocabulary word handling and better morphological representations.

Contrastive learning in deep learning: The negative sampling idea — learn by distinguishing positives from negatives — became foundational. SimCLR, MoCo, CLIP, and other contrastive methods all use the same principle: push positive pairs together and negative pairs apart, without computing a full normalization.

Word embeddings as features: Pre-trained Word2Vec/GloVe embeddings became the standard initialization for NLP models from 2013-2018, preceding the contextual embedding revolution (ELMo, BERT).

The implicit matrix factorization

Levy and Goldberg (2014) proved a remarkable theorem about Skip-gram NEG: when trained to convergence, the word and context vectors satisfy:

wi · w'j = PMI(i, j) − log k

where PMI(i, j) = log(P(i,j) / (P(i)P(j))) is the pointwise mutual information. This means Skip-gram NEG is implicitly factorizing a shifted PMI matrix. The "neural network" is doing matrix factorization in disguise.

This insight was hugely influential. It showed that the count-vs-predict debate was a false dichotomy — both approaches were optimizing closely related objectives. It also explained why word vectors have linear structure: PMI captures the log-odds of co-occurrence, and log-odds are additive by nature.

From Word2Vec to modern contrastive learning

The negative sampling paradigm can be summarized as: "learn representations by distinguishing real data from noise." This pattern appears throughout modern deep learning:

MethodYearPositive pairNegative pair
Word2Vec NEG2013(center word, context word)(center word, random word)
SimCLR2020(image, augmented image)(image, different image)
MoCo2020(query, matching key)(query, queue of keys)
CLIP2021(image, matching text)(image, wrong text)
DPO2023(prompt, preferred response)(prompt, dispreferred response)

Every row follows the same template: push positive pairs together, push negative pairs apart. Word2Vec NEG was the prototype.

The complete Word2Vec recipe (2013)

Combining everything from both papers, the full Word2Vec training recipe is:

1. Preprocess
Tokenize corpus. Build vocabulary (min count = 5). Compute word frequencies.
2. Phrase detection
2-4 passes of PMI-based bigram scoring. "New York" → "New_York".
3. Build tables
Compute subsampling probabilities. Build 100M-entry unigram table for negative sampling.
4. Train
Skip-gram with NEG (k=5). Linear LR decay 0.025 → 0.0001. Hogwild multi-threading.
5. Output
Save W_in as word vectors. Binary format for gensim, text format for general use.

The complete training pipeline in Python

python
# Complete Word2Vec pipeline using gensim
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser

# Step 1: Load tokenized corpus
corpus = [["new", "york", "is", "a", "big", "city"],
          ["ice", "cream", "is", "cold"],
          # ... millions more sentences]

# Step 2: Phrase detection
phrases = Phrases(corpus, min_count=5, threshold=100)
phraser = Phraser(phrases)
corpus_with_phrases = [phraser[sent] for sent in corpus]
# ["new_york", "is", "a", "big", "city"]

# Step 3-4: Train
model = Word2Vec(
    corpus_with_phrases,
    vector_size=300,
    window=5,
    min_count=5,
    sg=1,           # Skip-gram
    negative=5,     # k = 5 negative samples
    sample=1e-5,    # subsampling threshold
    workers=8,
    epochs=5,
)

# Step 5: Save and use
model.save("word2vec.model")
model.wv.save_word2vec_format("vectors.bin", binary=True)

# Use phrase vectors
print(model.wv.most_similar("new_york"))
# [('los_angeles', 0.82), ('chicago', 0.79), ('manhattan', 0.77), ...]

Cheat sheet

Negative Sampling
Replace O(V) softmax with O(k) binary classification. k=5 for large data. Noise dist: f(w)3/4
Subsampling
Discard frequent words with P = 1 − √(t/f). Faster AND better. t = 10−5
Phrase Detection
PMI-based bigram scoring. score = (count_ij − δ) / (count_i · count_j). Iterative for n-grams
Compositionality
vec(A) + vec(B) ≈ vec(concept combining A and B). Works for compositional, not idiomatic phrases
Impact
54% → 72% analogy accuracy. Launched contrastive learning paradigm. Standard NLP features for 5+ years
How did the negative sampling idea influence modern deep learning beyond NLP?