Mikolov, Chen, Corrado, Dean (Google) — ICLR 2013

Word2Vec: Words as Vectors

Two shallow neural networks that learn to embed words into a continuous vector space where similar words cluster together — and king − man + woman ≈ queen.

Prerequisites: Basic linear algebra + Softmax + Gradient descent
10
Chapters
8+
Simulations

Chapter 0: The Problem

You are building a search engine. A user types "automobile" and you want to return documents about "cars." You are building a translation system and need to know that "king" and "queen" are related the way "man" and "woman" are related. You are building a sentiment analyzer and want to know that "excellent" and "superb" carry the same meaning.

In 2013, the dominant way to represent words in NLP was as atomic symbols. Each word was an index in a vocabulary. "Cat" was word #4,271. "Dog" was word #8,392. There was no notion of similarity between them. The number 4,271 tells you nothing about cats, and subtracting 8,392 from 4,271 tells you nothing about the relationship between cats and dogs.

This representation — a word as an ID — fails in three fundamental ways:

  1. No generalization. Learning something about "cat" teaches you nothing about "kitten." Every word is an island.
  2. No relationships. You cannot compute that "king" is to "queen" as "man" is to "woman." Relationships are invisible.
  3. Sparsity. With a vocabulary of 100,000 words, the representation is a 100,000-dimensional vector with a single 1. Most of this vector is wasted zeros.
Words as IDs vs. Words as Vectors

Left: words as isolated IDs with no structure. Right: words as points in a continuous space where proximity encodes meaning. Click "Word2Vec" to see what we want to achieve.

The dream: What if we could represent each word as a dense vector of, say, 300 real numbers? And what if words with similar meanings ended up with similar vectors? And what if the directions in this space encoded semantic relationships — so that the vector from "man" to "king" was the same as the vector from "woman" to "queen"? Mikolov et al. showed this is not only possible but shockingly easy to achieve with two shallow neural networks trained on raw text.
What is the fundamental problem with representing words as integer IDs?

Chapter 1: One-Hot Encoding

Before Word2Vec, the standard way to feed words into a neural network was the one-hot vector. If your vocabulary has V words, you represent each word as a vector of length V with exactly one entry set to 1 and all others set to 0.

For example, with a tiny vocabulary of 5 words:

"cat"
[1, 0, 0, 0, 0]
"dog"
[0, 1, 0, 0, 0]
"fish"
[0, 0, 1, 0, 0]
"king"
[0, 0, 0, 1, 0]
"queen"
[0, 0, 0, 0, 1]

The problem is now mathematically precise. The dot product of any two different one-hot vectors is exactly zero:

one_hot("cat") · one_hot("dog") = 0
one_hot("king") · one_hot("queen") = 0

"Cat" is as different from "dog" as "king" is from "fish." Every pair of words is equally dissimilar. This is the orthogonality problem: one-hot vectors are mutually orthogonal, so no similarity structure exists.

With a real vocabulary, this gets worse. Google's Word2Vec was trained with V = 1,000,000. That means each word is a million-dimensional vector with 999,999 zeros. This is wasteful, uninformative, and computationally prohibitive for any downstream model to process directly.

The key realization: A one-hot vector is just a lookup key. It says "I am word #4,271" but carries zero information about what word #4,271 means. What we want is a distributed representation — a dense vector where every dimension contributes a little bit of meaning. In such a space, "cat" and "kitten" would have similar vectors because they share semantic features: they're animals, they're small, they're pets.

Dimensionality comparison

The contrast between one-hot and dense representations is stark:

PropertyOne-hotDense embedding
DimensionsV (e.g., 1,000,000)d (e.g., 300)
Non-zero entries1All d
Memory per wordV × 4 bytes = 4 MBd × 4 bytes = 1.2 KB
Dot product meaning0 for all pairs (no info)Similarity score
Information contentlog2(V) bits (just an ID)~300 floats of learned semantics

The embedding idea

What if we learned a matrix W of shape (V × d), where d ≪ V? Each row of W is a d-dimensional vector for one word. To get the embedding for word i, we simply multiply:

ei = one_hot(i) · W = W[i, :]

The one-hot multiplication is just a row lookup. But if we learn the values in W by training on a useful task, the resulting rows will encode meaning. This is the core of Word2Vec: learn W by predicting words from their context.

One-Hot to Dense Embedding

Click a word to see its one-hot vector multiplied by the embedding matrix W. The result is a dense, low-dimensional vector. Drag the slider to change the embedding dimension d.

Embed dim d 4
python
import numpy as np

V = 10000   # vocabulary size
d = 300     # embedding dimension

# The embedding matrix (randomly initialized, then learned)
W = np.random.randn(V, d) * 0.01

# One-hot lookup is just row indexing
word_id = 42
one_hot = np.zeros(V)
one_hot[word_id] = 1

# These two are equivalent:
embedding_matmul = one_hot @ W        # shape: (d,)
embedding_lookup = W[word_id]          # shape: (d,) — just a row lookup
assert np.allclose(embedding_matmul, embedding_lookup)
Why is multiplying a one-hot vector by an embedding matrix equivalent to a simple row lookup?

Chapter 2: The Distributional Hypothesis

We want to learn dense word vectors where similar words end up close together. But what does "similar" mean? How do we decide that "cat" should be near "kitten" and far from "democracy"?

The answer comes from a beautifully simple idea from linguistics, articulated by J.R. Firth in 1957:

"You shall know a word by the company it keeps." Words that appear in similar contexts have similar meanings. "Cat" and "kitten" both appear near "pet," "fluffy," "purr," "milk." "Democracy" appears near "government," "vote," "election." The contexts define the meaning.

This is the distributional hypothesis. It says that semantic similarity can be derived entirely from patterns of co-occurrence in text. You do not need dictionaries, ontologies, or human annotations. You just need a lot of text.

Consider these sentences:

Sentence 1
The cat sat on the mat and purred.
Sentence 2
The kitten sat on the rug and purred.
Sentence 3
The dog sat on the mat and barked.
Sentence 4
The president sat at the desk and signed.

"Cat" and "kitten" share almost identical contexts (sat, on, the, mat/rug, purred). "Dog" shares most context with "cat" (sat, on, the, mat) but differs in one key word (barked vs. purred). "President" shares far less context. From this, a good algorithm should learn: cat ≈ kitten > dog > president, in terms of similarity.

The context window

Word2Vec operationalizes this by defining a context window of size c around each target word. For the sentence "the cat sat on the mat" with c = 2, the context of "sat" is {the, cat, on, the} — the two words before and after.

Sliding Context Window

Watch the context window slide across a sentence. The center word is the target (orange); the highlighted words are the context (teal). Drag the slider to change window size c.

Window c 2
Target position 3

Word2Vec's genius is turning the distributional hypothesis into a prediction task. Instead of counting co-occurrences (the traditional approach), it predicts words from their context. The prediction errors shape the embedding vectors. Two words that can substitute for each other in many contexts end up with similar vectors — because similar vectors make similar predictions.

From counting to predicting

Before Word2Vec, the distributional hypothesis was applied through count-based methods: build a large matrix of word co-occurrence counts, then reduce its dimensionality with SVD. This worked but had limitations:

Word2Vec showed that you can skip the matrix entirely. Instead of counting how often words co-occur and then compressing the counts, just train a model to predict co-occurrence. The embeddings learned by prediction turn out to have richer structure than those learned by counting — specifically, the linear analogy structure (king − man + woman ≈ queen) that made Word2Vec famous.

Window size matters

The window size c controls what kind of similarity the vectors capture:

Window sizeType of similarityExample neighbors of "dog"
c = 1-2 (narrow)Syntacticcat, horse, bird (same syntactic role)
c = 5-10 (wide)Semanticpuppy, pet, canine (same meaning)
c = 15+ (very wide)Topicalveterinarian, leash, breed (same topic)

Narrow windows find words that can substitute for each other in a sentence. Wide windows find words that appear in the same kinds of documents. The paper's default of c = 5-10 captures a mix of syntactic and semantic similarity.

Counting context features: a manual exercise

To build intuition, let's manually trace what Word2Vec would learn from a tiny corpus. Consider just four sentences:

  1. "the cat sat on the mat"
  2. "the cat ate the fish"
  3. "the dog sat on the rug"
  4. "the dog ate the bone"

With c = 1 (looking at only the immediately adjacent words), the context for each word includes:

Target wordObserved contexts (c=1)
catthe(4x), sat(1x), ate(1x)
dogthe(4x), sat(1x), ate(1x)
satcat(1x), dog(1x), on(2x)
atecat(1x), dog(1x), the(2x)
matthe(1x)
fishthe(1x)

"Cat" and "dog" have identical context distributions! They both appear after "the" and before "sat" and "ate." Word2Vec would give them nearly identical vectors. Meanwhile, "sat" and "ate" also have similar contexts (both appear after "cat" and "dog"), so they'd be close too — but slightly different because "sat" co-occurs with "on" while "ate" co-occurs with "the."

This simple example illustrates the distributional hypothesis in action: words with the same syntactic and semantic roles get the same contexts, and therefore the same vectors.

According to the distributional hypothesis, why would "cat" and "kitten" get similar word vectors?

Chapter 3: The CBOW Architecture

The first of Word2Vec's two models is Continuous Bag-of-Words (CBOW). The task: given the context words around a gap, predict the missing center word.

For the sentence "the cat sat on the mat" with window c = 2, if the target is "sat" (position 2), the training example is:

Input (context)
{"the", "cat", "on", "the"}
↓ predict
Output (target)
"sat"

The architecture is a single hidden layer neural network:

Step 1: Look up context embeddings

Each context word is mapped to its embedding using the input matrix Win (shape V × d). This is just a row lookup for each context word.

Step 2: Average the context vectors

CBOW takes the average of all context word embeddings. This is the "bag-of-words" part — order does not matter, only which words are present.

h = (1 / 2c) ∑i ∈ context Win[i, :]

Where h is the hidden representation (shape d).

Step 3: Predict the target word

The hidden vector h is multiplied by an output matrix Wout (shape d × V) to produce scores for every word in the vocabulary, then softmax converts these to probabilities:

scores = h · Wout
P(wj | context) = exp(scorej) / ∑k=1V exp(scorek)

Step 4: Compute the loss

The loss is the negative log-probability of the true target word:

L = −log P(wtarget | context)
Why "continuous"? Earlier bag-of-words models used discrete features (word counts). CBOW uses continuous embeddings — dense real-valued vectors — hence the name. The "bag" part means it ignores word order within the context window.
CBOW Forward Pass

Context words are embedded, averaged into a hidden vector h, then projected to a probability distribution over the vocabulary. Click "Step" to walk through each stage.

Click Step to begin
python
import numpy as np

def cbow_forward(context_ids, W_in, W_out):
    """
    context_ids: list of int, shape (2*c,) — IDs of context words
    W_in:  (V, d) — input embedding matrix
    W_out: (d, V) — output projection matrix
    Returns: probability distribution over vocab, shape (V,)
    """
    # Step 1: look up embeddings for each context word
    embeddings = W_in[context_ids]   # shape: (2c, d)

    # Step 2: average them
    h = embeddings.mean(axis=0)     # shape: (d,)

    # Step 3: project to vocabulary scores
    scores = h @ W_out               # shape: (V,)

    # Step 4: softmax
    exp_scores = np.exp(scores - scores.max())
    probs = exp_scores / exp_scores.sum()

    return probs  # P(w | context) for each word w
Two matrices, not one. CBOW uses two separate matrices: Win for input embeddings (context words) and Wout for output predictions (target words). After training, we typically keep only Win as the final word vectors. Some practitioners average Win and Wout for slightly better results.

Worked example: CBOW gradient update

Let's trace a complete forward and backward pass with a tiny vocabulary of 3 words and embedding dimension d = 2.

Vocabulary: {cat = 0, sat = 1, mat = 2}. Context = {cat, mat}, target = sat. Window c = 1.

Suppose our current matrices are:

Win = [[0.5, −0.2], [0.1, 0.8], [−0.3, 0.4]]
Wout = [[0.3, 0.6], [−0.1, 0.5], [0.2, −0.3]]

Forward pass:

  1. Look up context embeddings: vcat = [0.5, −0.2], vmat = [−0.3, 0.4]
  2. Average: h = ([0.5, −0.2] + [−0.3, 0.4]) / 2 = [0.1, 0.1]
  3. Scores: s = Wout · h = [0.09, 0.04, −0.01]
  4. Softmax: exp(s) = [1.094, 1.041, 0.990], sum = 3.125, P = [0.350, 0.333, 0.317]
  5. Loss: L = −log P(sat) = −log(0.333) = 1.099

Backward pass:

  1. Error vector: e = P − one_hot(sat) = [0.350, −0.667, 0.317]
  2. Gradient for Wout[sat]: ∂L/∂v'sat = e[1] · h = −0.667 · [0.1, 0.1] = [−0.067, −0.067]. This pushes the "sat" output vector toward h.
  3. Gradient for h: ∂L/∂h = WoutT · e = sum of error-weighted output vectors
  4. Distribute ∂L/∂h equally to each context word's input embedding
python
# Complete CBOW training step with backpropagation
import numpy as np

V, d = 3, 2
W_in = np.array([[0.5, -0.2], [0.1, 0.8], [-0.3, 0.4]])
W_out = np.array([[0.3, 0.6], [-0.1, 0.5], [0.2, -0.3]])

context_ids = [0, 2]  # cat, mat
target_id = 1         # sat
lr = 0.1

# Forward
h = W_in[context_ids].mean(axis=0)           # [0.1, 0.1]
scores = W_out @ h                             # [0.09, 0.04, -0.01]
exp_s = np.exp(scores - scores.max())
probs = exp_s / exp_s.sum()                    # [0.350, 0.333, 0.317]
loss = -np.log(probs[target_id])               # 1.099

# Backward
error = probs.copy()
error[target_id] -= 1                         # [0.35, -0.67, 0.32]

# Gradient for output matrix: outer product of error and h
dW_out = np.outer(error, h)                    # (3, 2)
W_out -= lr * dW_out

# Gradient for hidden: project error back through output matrix
dh = W_out.T @ error                           # (2,)
# Distribute equally to all context words (CBOW averages, so grad is shared)
for cid in context_ids:
    W_in[cid] -= lr * dh / len(context_ids)

print(f"Loss before: {loss:.3f}")  # 1.099
print(f"P(sat|context) before: {probs[target_id]:.3f}")  # 0.333

What the gradient does geometrically

After one gradient step, three things happen simultaneously:

Over millions of training examples, this push-pull dynamic shapes the embedding space. Words that frequently predict each other (because they share contexts) have their vectors pulled together. Words that should not predict each other have their vectors pushed apart.

In CBOW, what is the hidden representation h?

Chapter 4: The Skip-gram Architecture

Skip-gram flips CBOW on its head. Instead of predicting the center word from the context, it predicts each context word from the center word. Same data, opposite direction.

For the same sentence "the cat sat on the mat" with window c = 2 and target "sat":

Input (center)
"sat"
↓ predict each of
Output (context)
"the", "cat", "on", "the"

This generates 2c training pairs from each word position. For c = 2, each center word produces 4 pairs: (sat, the), (sat, cat), (sat, on), (sat, the).

The architecture

Even simpler than CBOW — no averaging needed:

Step 1: Look up the embedding of the center word from Win:

h = Win[wcenter, :]

Step 2: For each context position, compute the probability of each vocabulary word using the output matrix Wout:

P(wj | wcenter) = exp(Wout[j, :] · h) / ∑k=1V exp(Wout[k, :] · h)

Step 3: The loss sums over all context positions:

L = −∑j ∈ context log P(wj | wcenter)

Why Skip-gram often works better

Skip-gram treats each (center, context) pair as a separate training example, so it generates more training signal per word. For a sentence of length T with window c, Skip-gram produces ~2cT training pairs versus T for CBOW. This extra data is especially helpful for rare words: a rare word appears as a center word 2c times more often in Skip-gram's training data (once per context position) than in CBOW's (once total).

Worked example: Skip-gram training pairs

For the sentence "the cat sat on the mat" with c = 2, here are all the training pairs generated by Skip-gram:

Center wordContext wordDistance
"the" (pos 0)"cat" (pos 1)1
"the" (pos 0)"sat" (pos 2)2
"cat" (pos 1)"the" (pos 0)1
"cat" (pos 1)"sat" (pos 2)1
"cat" (pos 1)"on" (pos 3)2
"sat" (pos 2)"the" (pos 0)2
"sat" (pos 2)"cat" (pos 1)1
"sat" (pos 2)"on" (pos 3)1
"sat" (pos 2)"the" (pos 4)2
... and so on for "on", "the", "mat"

This 6-word sentence generates approximately 6 × 2 × 2 = 24 training pairs. A corpus of 6 billion words with c = 5 generates ~60 billion pairs. Each pair provides a gradient signal that shapes the embedding space.

The independence assumption

Skip-gram makes a strong simplifying assumption: each context word is predicted independently. The objective sums over context positions without any interaction:

L = −∑j ∈ context log P(wj | wcenter)

This means Skip-gram cannot capture the fact that "cat" and "mat" both being in the context is more informative than either alone. This is a deliberate simplification for computational efficiency — and it works remarkably well despite the approximation.

CBOW vs. Skip-gram: CBOW is faster to train (one prediction per window position) and works well for frequent words. Skip-gram is slower (2c predictions per position) but produces better embeddings for rare words. The paper found Skip-gram slightly outperformed CBOW on semantic tasks, while CBOW was better on syntactic tasks. In practice, Skip-gram became the more popular choice.
CBOW vs. Skip-gram

Toggle between architectures to see how data flows differently. CBOW: many-to-one (context predicts center). Skip-gram: one-to-many (center predicts context).

python
def skipgram_forward(center_id, W_in, W_out):
    """
    center_id: int — ID of center word
    W_in:  (V, d) — input embedding matrix
    W_out: (V, d) — output embedding matrix (note: transposed vs CBOW)
    Returns: probability distribution over vocab, shape (V,)
    """
    # Step 1: look up center word embedding
    h = W_in[center_id]          # shape: (d,)

    # Step 2: score every vocabulary word
    scores = W_out @ h            # shape: (V,) — dot product of each output vector with h

    # Step 3: softmax
    exp_scores = np.exp(scores - scores.max())
    probs = exp_scores / exp_scores.sum()

    return probs  # P(w_context | w_center) for each word

# Training: for each center word, predict each context word
for center_pos in range(len(sentence)):
    center_id = sentence[center_pos]
    for offset in range(-c, c+1):
        if offset == 0: continue
        context_pos = center_pos + offset
        if 0 <= context_pos < len(sentence):
            context_id = sentence[context_pos]
            probs = skipgram_forward(center_id, W_in, W_out)
            loss = -np.log(probs[context_id] + 1e-10)
How does Skip-gram differ from CBOW in its prediction task?

Chapter 5: The Training Objective

Both CBOW and Skip-gram end with a softmax over the entire vocabulary. Let's understand exactly what this means computationally — and why it becomes a bottleneck.

The full softmax

For Skip-gram, the probability of context word wO given center word wI is:

P(wO | wI) = exp(v'wO · vwI) / ∑j=1V exp(v'j · vwI)

Here, vw is the input vector (from Win) and v'w is the output vector (from Wout) for word w. The dot product v'wO · vwI measures how well the center word predicts this particular context word.

The computational cost

The denominator sums over all V words. For V = 1,000,000, every single training step requires computing 1 million dot products and exponentials. With a corpus of billions of words and multiple context positions per word, this becomes impossibly slow.

Cost per step = O(V · d)

For V = 106 and d = 300, that is 300 million floating-point operations just for the softmax — per training example. With billions of training examples, full softmax training would take years.

The gradient

The gradient of the loss with respect to the output vector v'j is:

∂L/∂v'j = (P(j | wI) − δj=wO) · vwI

Where δj=wO is 1 if j is the actual context word and 0 otherwise. This gradient has a beautiful interpretation: for the correct context word (j = wO), the gradient pushes v'j closer to vwI by an amount proportional to 1 − P(j | wI). For all other words, the gradient pushes v'j away from vwI by an amount proportional to P(j | wI).

The softmax bottleneck: Every training step must update the output vectors of all V words, even though only one word is the correct answer. With V = 1 million, 999,999 words get a "push away" gradient. This is spectacularly wasteful. Chapter 6 covers the tricks that make training feasible: hierarchical softmax and negative sampling.

Numerical stability: the log-sum-exp trick

Computing softmax naively causes overflow. If any score is large (say 500), exp(500) = ∞. The standard trick: subtract the maximum score before exponentiating.

P(wj) = exp(sj − max(s)) / ∑k exp(sk − max(s))

This gives the same probabilities (the max cancels in the ratio) but keeps all exponentials in a numerically safe range. Every softmax implementation uses this trick.

Why not just use cross-entropy on one-hot labels?

The loss −log P(wtarget | wcenter) is exactly the cross-entropy between the one-hot label vector and the softmax output. This is the same loss used in standard classification. Word2Vec is classification — it classifies the center word's context into one of V categories. The word vectors are a byproduct of learning to classify well.

This insight — that useful representations emerge as byproducts of prediction tasks — is one of the most important ideas in modern ML. It foreshadows BERT (predict masked tokens), GPT (predict next token), and contrastive learning (predict matching pairs).

Worked example: softmax step by step

Vocabulary: V = 4 words: {cat=0, dog=1, sat=2, mat=3}. Center word: "sat" (id=2). True context: "cat" (id=0). Embedding dim d = 2.

Current vectors:

vsat = [0.5, −0.3]   (input)
v'cat = [0.4, 0.2],   v'dog = [0.1, −0.5],   v'sat = [−0.2, 0.1],   v'mat = [0.3, 0.6]

Dot products (scores):

scat = 0.5·0.4 + (−0.3)·0.2 = 0.20 − 0.06 = 0.14
sdog = 0.5·0.1 + (−0.3)·(−0.5) = 0.05 + 0.15 = 0.20
ssat = 0.5·(−0.2) + (−0.3)·0.1 = −0.10 − 0.03 = −0.13
smat = 0.5·0.3 + (−0.3)·0.6 = 0.15 − 0.18 = −0.03

Softmax (subtract max = 0.20 for stability):

exp([0.14−0.20, 0, −0.13−0.20, −0.03−0.20]) = exp([−0.06, 0, −0.33, −0.23])
= [0.942, 1.000, 0.719, 0.795]
Sum = 3.456
P = [0.273, 0.289, 0.208, 0.230]

Loss:

L = −log P(cat) = −log(0.273) = 1.299

The model gives "cat" a 27.3% probability. Not terrible for random initialization (uniform would be 25%), but there's room for improvement. The gradient will push v'cat toward vsat and push the other output vectors away.

Gradient for each output vector v'j:

∂L/∂v'cat = (0.273 − 1) · vsat = −0.727 · [0.5, −0.3] = [−0.364, 0.218]
∂L/∂v'dog = (0.289 − 0) · vsat = 0.289 · [0.5, −0.3] = [0.145, −0.087]
∂L/∂v'sat = (0.208 − 0) · vsat = 0.208 · [0.5, −0.3] = [0.104, −0.062]
∂L/∂v'mat = (0.230 − 0) · vsat = 0.230 · [0.5, −0.3] = [0.115, −0.069]

After subtracting lr · gradient: v'cat moves 0.036 toward vsat. The other three output vectors each move slightly away from vsat. Over millions of such updates, the geometry of the embedding space takes shape.

Softmax Computation Cost

The denominator requires summing over all V words. Drag the vocabulary size to see how cost scales linearly. The red zone marks where full softmax becomes impractical.

Vocab V 10K
python
def skipgram_loss_and_gradient(center_id, context_id, W_in, W_out):
    """Full softmax loss and gradient for one (center, context) pair."""
    v_in = W_in[center_id]           # (d,) — center embedding
    scores = W_out @ v_in             # (V,) — dot product with every output vector

    # Softmax — this sum over V is the bottleneck
    exp_scores = np.exp(scores - scores.max())
    probs = exp_scores / exp_scores.sum()  # (V,)

    # Loss: negative log-prob of the true context word
    loss = -np.log(probs[context_id] + 1e-10)

    # Gradient w.r.t. output vectors
    grad_out = probs.copy()             # (V,) — P(j|center) for all j
    grad_out[context_id] -= 1           # subtract 1 for the true word
    dW_out = np.outer(grad_out, v_in)  # (V, d) — updates ALL V output vectors

    # Gradient w.r.t. input vector of center word
    dv_in = W_out.T @ grad_out          # (d,)

    return loss, dv_in, dW_out
Why is the full softmax computationally expensive for Word2Vec?

Chapter 6: Efficiency Tricks

The original Word2Vec paper introduced hierarchical softmax as the first solution to the softmax bottleneck. The follow-up paper (Mikolov et al., 2013b) introduced negative sampling, which became more popular. We cover hierarchical softmax here; negative sampling gets its own veanor.

Hierarchical softmax

Instead of a flat softmax over V words, arrange the vocabulary as a binary tree. Each word is a leaf. To compute the probability of a word, you walk from the root to that leaf, making a binary (left/right) decision at each internal node.

At each internal node n, you compute:

P(go left at node n) = σ(v'n · h)
P(go right at node n) = 1 − σ(v'n · h)

The probability of reaching a word w at depth L is the product of the binary decisions along the path:

P(w) = ∏j=1L σ(sj · v'nj · h)

where sj ∈ {−1, +1} encodes whether you go left or right at node j.

Why is this fast?

A balanced binary tree has depth log2(V). So instead of summing over V terms, you compute log2(V) sigmoids. For V = 1,000,000:

Full softmax: O(V) = 1,000,000 operations
Hierarchical softmax: O(log V) = 20 operations

That is a 50,000x speedup.

Huffman tree

The paper uses a Huffman tree based on word frequency, not a balanced binary tree. Frequent words get shorter paths (fewer decisions), rare words get longer paths. Since frequent words appear more often in training, this minimizes the average number of operations per step.

Hierarchical softmax is not an approximation. It computes the exact probability distribution over words — it's just organized as a tree of binary classifiers instead of a flat softmax. The probabilities still sum to 1 across all leaves. The tree structure only changes the computational complexity, not the mathematical result.

Worked example: probability computation

Consider a tiny Huffman tree with 4 words: the(common), cat, sat, zinc(rare). In a Huffman tree, "the" gets a short path (1 decision) and "zinc" gets a long path (3 decisions).

Suppose the hidden vector is h = [0.5, 0.3] and the tree nodes have vectors:

Then:

P("the") = P(left at root) = σ(0.34) = 0.584
P("cat") = P(right at root) × P(left at right child) = (1 − 0.584) × 0.530 = 0.416 × 0.530 = 0.220

Only 1 sigmoid for "the" (most common word) vs. 2 for "cat." The Huffman tree ensures the average path length is minimized based on word frequencies — an optimal prefix code.

Training with hierarchical softmax

During backpropagation, only the node vectors along the path from root to the target word receive gradient updates. For a target word at depth L, only L node vectors are updated (plus the center word's input vector). Compared to full softmax which updates all V output vectors, this is a massive saving.

Implementation detail: the unigram table

For efficient sampling of negative words, the original Word2Vec implementation precomputes a large "unigram table" — an array of 100 million entries where each word appears proportionally to f(w)3/4. To sample a negative word, generate a random integer index into this table and look up the word. This is O(1) per sample, much faster than computing cumulative distributions.

python
# Build unigram table for O(1) negative sampling
import numpy as np

def build_unigram_table(word_freqs, table_size=100_000_000, power=0.75):
    """Build a lookup table for O(1) negative sampling."""
    powered = word_freqs ** power
    norm = powered / powered.sum()

    table = np.zeros(table_size, dtype=np.int32)
    idx = 0
    cumulative = 0.0

    for word_id in range(len(word_freqs)):
        cumulative += norm[word_id]
        while idx < table_size and idx / table_size < cumulative:
            table[idx] = word_id
            idx += 1

    return table

# Sampling is now O(1):
# neg_word = table[random.randint(0, table_size)]

Hyperparameter summary

HyperparameterPaper defaultEffect
Embedding dim d300Quality increases up to ~300, then plateaus
Window size c5 (SG), 5 (CBOW)Larger = more semantic; smaller = more syntactic
Learning rate0.025 (SG), 0.05 (CBOW)Linearly decayed during training
Min word count5Words below this threshold are discarded
Negative samples k5-20More = slower but better for small data
Subsampling t10−5Threshold for discarding frequent words
Training epochs1-5More epochs on small data; 1 pass on large data
Hierarchical Softmax Tree

A binary tree over the vocabulary. To compute P("cat"), follow the path from root to the "cat" leaf. Each node makes a binary decision using a sigmoid. Click a leaf word to highlight its path.

python
def hierarchical_softmax(word_id, h, tree_vectors, tree_paths, tree_codes):
    """
    word_id:      target word
    h:            hidden vector (d,) — from center/context embedding
    tree_vectors: (num_internal_nodes, d) — learned vectors at each node
    tree_paths:   list of lists — path from root to each word
    tree_codes:   list of lists — left(0)/right(1) at each step
    """
    path = tree_paths[word_id]     # e.g., [0, 3, 7] — internal node IDs
    codes = tree_codes[word_id]    # e.g., [1, 0, 1] — go right, left, right

    log_prob = 0.0
    for node_id, code in zip(path, codes):
        v_node = tree_vectors[node_id]      # (d,)
        dot = v_node @ h                     # scalar
        sig = 1.0 / (1.0 + np.exp(-dot))      # sigmoid

        if code == 0:   # go left
            log_prob += np.log(sig + 1e-10)
        else:            # go right
            log_prob += np.log(1 - sig + 1e-10)

    return log_prob   # log P(word | h) — only log2(V) operations!
How much speedup does hierarchical softmax provide over full softmax for a vocabulary of 1 million words?

Chapter 7: Vector Arithmetic

This is the most surprising result of the paper — the reason Word2Vec captured the imagination of the entire ML community. The learned word vectors exhibit linear algebraic structure that encodes semantic relationships.

The most famous example:

vec("king") − vec("man") + vec("woman") ≈ vec("queen")

This is not a cherry-picked curiosity. The paper found systematic, consistent vector relationships across many semantic categories:

RelationshipExampleVector operation
Genderking : queen :: man : womanking − man + woman ≈ queen
Country-CapitalFrance : Paris :: Japan : TokyoFrance − Paris + Tokyo ≈ Japan
Tensewalking : walked :: swimming : swamwalking − walked + swam ≈ swimming
Comparativebig : bigger :: small : smallerbig − bigger + smaller ≈ small
Pluralcar : cars :: apple : applescar − cars + apples ≈ apple

How this is computed

Given an analogy "a is to b as c is to ?", we compute:

vec(?) = vec(b) − vec(a) + vec(c)

Then we find the word whose vector is closest to vec(?) by cosine similarity:

answer = argmaxw cos(vec(w), vec(?)) = argmaxw (vec(w) · vec(?)) / (||vec(w)|| · ||vec(?)||)

(excluding a, b, c themselves from the search).

Why does this work?

The training objective forces the model to capture co-occurrence patterns. "King" and "queen" appear in similar contexts (royal, throne, crown) but differ in gender-related contexts. This difference is captured as a consistent direction in the vector space — the same direction that separates "man" from "woman."

More formally, if a direction d encodes "maleness" then:

vec("king") ≈ vec("royalty") + d
vec("queen") ≈ vec("royalty") − d
vec("man") ≈ vec("person") + d
vec("woman") ≈ vec("person") − d

Subtracting "man" from "king" removes the "person + d" component and leaves "royalty − person". Adding "woman" gives "royalty − person + person − d = royalty − d = queen". The direction d cancels in the subtraction and reappears in the addition.

The deeper lesson: Linear relationships in word vectors emerge because the training objective (predicting context) forces the model to encode semantic features as directions in the embedding space. Each dimension is not explicitly assigned a meaning, but the space self-organizes so that meaningful relationships correspond to vector arithmetic. This was the first strong evidence that simple prediction tasks on raw text could discover rich semantic structure automatically.

Cosine similarity vs. Euclidean distance

Word2Vec uses cosine similarity rather than Euclidean distance to measure word similarity. Why?

Euclidean distance is sensitive to vector magnitude. A word that appears frequently tends to have a larger-magnitude vector (more training updates push it further from the origin). Cosine similarity normalizes by magnitude, measuring only the angle between vectors:

cos(u, v) = (u · v) / (||u|| · ||v||) = ∑i uivi / (√∑iui2 · √∑ivi2)

Range: −1 (opposite directions) to +1 (same direction). Two words with cosine similarity 0.9 are nearly interchangeable in many contexts. Words with similarity 0.1 are semantically distant.

Typical similarity values

Word pairCosine similarityInterpretation
cat — kitten0.89Very similar (same concept)
cat — dog0.76Similar (both pets/animals)
cat — car0.21Unrelated
king — queen0.73Related (both royalty, different gender)
king — man0.62Moderately related (both male, different status)
good — bad0.72High! Antonyms are close because they appear in similar contexts
Cosine Similarity in Practice

Real cosine similarities between word pairs in Word2Vec. Notice that "good" and "bad" (antonyms) are surprisingly similar because they appear in identical syntactic contexts.

Antonyms are close. A surprising property: "good" and "bad" have high cosine similarity (~0.72) because they appear in nearly identical syntactic contexts ("the movie was good/bad," "it's not good/bad"). The distributional hypothesis captures substitutability, not opposition. This is a well-known limitation of static word vectors — they conflate different types of semantic relationships.
Word Vector Analogy Explorer

Explore word analogies in 2D. The orange arrow (king − man) is the "royalty" direction. Adding it to "woman" lands near "queen." Click different analogy types to see other relationships.

python
import numpy as np

def analogy(a, b, c, word_vectors, vocab):
    """
    Solve: a is to b as c is to ?
    word_vectors: dict mapping word -> np.array of shape (d,)
    vocab: list of all words
    """
    # Compute the target vector: b - a + c
    target = word_vectors[b] - word_vectors[a] + word_vectors[c]

    # Normalize the target
    target = target / (np.linalg.norm(target) + 1e-10)

    best_word, best_sim = None, -1
    exclude = {a, b, c}

    for word in vocab:
        if word in exclude:
            continue
        vec = word_vectors[word]
        vec_norm = vec / (np.linalg.norm(vec) + 1e-10)
        sim = np.dot(target, vec_norm)  # cosine similarity
        if sim > best_sim:
            best_sim = sim
            best_word = word

    return best_word, best_sim

# Example: king - man + woman = ?
answer, score = analogy("man", "king", "woman", vectors, vocab)
print(f"{answer} (similarity: {score:.3f})")  # queen (similarity: 0.891)
Why does vec("king") − vec("man") + vec("woman") ≈ vec("queen")?

Chapter 8: Evaluation

How do you measure whether word vectors are "good"? The paper introduced a systematic evaluation based on word analogies and compared against prior methods.

The analogy test set

The paper created a test set of 8,869 semantic analogies and 10,675 syntactic analogies. An analogy is answered correctly only if the exact word is the nearest neighbor — no partial credit.

Category# QuestionsExample
Common capitals506Athens : Greece :: Baghdad : Iraq
All capitals4,524Abuja : Nigeria :: Accra : Ghana
Currency866Algeria : dinar :: Angola : kwanza
City-in-state2,467Chicago : Illinois :: Houston : Texas
Family506boy : girl :: brother : sister
Adjective-adverb992apparent : apparently :: rapid : rapidly
Opposite812aware : unaware :: possible : impossible
Comparative1,332bad : worse :: big : bigger
Superlative1,056bad : worst :: big : biggest
Present-past tense1,560dance : danced :: fly : flew
Nationality1,599Australia : Australian :: France : French
Plural nouns1,332banana : bananas :: bird : birds
Plural verbs870decrease : decreases :: go : goes

Key results

The paper compared CBOW and Skip-gram against previous approaches (NNLM, RNNLM) across dimensions d and training data sizes:

ModelDimTraining WordsSemantic Acc %Syntactic Acc %
NNLM1006B2353
RNNLM640320M936
CBOW3001.6B1653
Skip-gram3001.6B5559
Skip-gram10006B6665

Skip-gram with d=1000 on 6 billion words achieved 66% accuracy on semantic analogies and 65% on syntactic analogies — dramatically better than any prior method. The improvement came from two factors: the more efficient architecture allowed training on much larger datasets, and the Skip-gram objective captured semantic structure more effectively.

What the paper discovered about scaling

Three scaling laws emerged:

  1. More data beats bigger models. 300-dimensional vectors on 6B words beat 600-dimensional vectors on 1B words.
  2. Diminishing returns on dimension. Going from 100 to 300 dimensions helped a lot. Going from 300 to 1000 helped less.
  3. Training time matters. More epochs on the same data improved results, up to a point. But more data was always better than more epochs.
Accuracy vs. Training Corpus Size

Analogy accuracy improves with more training data. The curve shows diminishing returns but no plateau — more data always helps. Drag to see how dimension and data interact.

Dimension d 300
Training details: The paper used Google News data (6 billion tokens, 1 million word vocabulary). Training Skip-gram with d=300 on this data took about a day on a single machine using the hierarchical softmax optimization. By comparison, the NNLM baseline (Bengio et al., 2003) needed weeks on a much smaller dataset. The 1000x speedup came from the shallow architecture and hierarchical softmax — no hidden layer nonlinearity, just embeddings.

Why NNLM and RNNLM were slower

The Neural Network Language Model (NNLM) of Bengio et al. used a full hidden layer with nonlinearities:

NNLM
Concatenate context embeddings → Hidden layer (tanh) → Output softmax over V
Word2Vec
Average/lookup context embeddings → Output (hier. softmax or NEG)

The hidden layer in NNLM requires a matrix multiplication of size (c·d × H) plus a nonlinearity, plus another matrix multiplication (H × V). Word2Vec removes the hidden layer entirely — the input embeddings connect directly to the output. This is why Word2Vec is called a "shallow" model. It has exactly zero hidden layers.

Complete PyTorch implementation of Skip-gram with NEG

python
import torch
import torch.nn as nn

class SkipGramNEG(nn.Module):
    def __init__(self, V, d):
        super().__init__()
        self.W_in = nn.Embedding(V, d)    # input embeddings
        self.W_out = nn.Embedding(V, d)   # output embeddings
        # Initialize small random values
        nn.init.uniform_(self.W_in.weight, -0.5/d, 0.5/d)
        nn.init.zeros_(self.W_out.weight)

    def forward(self, center, context, negatives):
        """
        center:    (batch,) — center word IDs
        context:   (batch,) — true context word IDs
        negatives: (batch, k) — negative sample IDs
        Returns: scalar loss
        """
        v_in = self.W_in(center)          # (batch, d)
        v_pos = self.W_out(context)       # (batch, d)
        v_neg = self.W_out(negatives)     # (batch, k, d)

        # Positive: push context toward center
        pos_dot = (v_in * v_pos).sum(dim=1)         # (batch,)
        pos_loss = -torch.nn.functional.logsigmoid(pos_dot)

        # Negative: push noise away from center
        neg_dot = torch.bmm(v_neg, v_in.unsqueeze(2)).squeeze()  # (batch, k)
        neg_loss = -torch.nn.functional.logsigmoid(-neg_dot).sum(dim=1)

        return (pos_loss + neg_loss).mean()

The entire model is two embedding layers and a few dot products. No hidden layers, no convolutions, no attention. This simplicity is what allowed training on billions of words in 2013 — years before GPUs became standard for NLP.

What Word2Vec is NOT

Many descriptions of Word2Vec call it a "deep learning" model. This is misleading. Word2Vec has zero hidden layers. It is a log-linear model — the simplest possible neural network. Its power comes not from depth or complexity but from:

  1. The prediction task: Predicting context from words forces the embeddings to encode semantic structure.
  2. Scale: Training on billions of words — enabled by the shallow architecture and efficient softmax alternatives — gives the model enough signal to discover fine-grained relationships.
  3. The embedding space: The continuous vector space allows algebraic operations (addition, subtraction, cosine similarity) that are impossible with discrete representations.

The lesson for modern ML: sometimes the biggest breakthroughs come from simplifying models and training them on more data, not from adding complexity.

gensim: the practical tool

In practice, most people use the gensim Python library to train and use Word2Vec, not a custom implementation:

python
from gensim.models import Word2Vec

# Train on your own corpus
sentences = [["the", "cat", "sat"], ["the", "dog", "ran"]]
model = Word2Vec(
    sentences,
    vector_size=300,    # embedding dimension
    window=5,            # context window size
    min_count=5,         # ignore rare words
    sg=1,                # 1 = Skip-gram, 0 = CBOW
    negative=5,          # number of negative samples
    sample=1e-5,         # subsampling threshold
    workers=4,           # parallel threads
    epochs=5,
)

# Use the trained model
vec = model.wv["cat"]                       # (300,)
sim = model.wv.similarity("cat", "dog")    # 0.76
neighbors = model.wv.most_similar("king")  # [("queen", 0.73), ...]

# Analogy: king - man + woman = ?
result = model.wv.most_similar(
    positive=["king", "woman"],
    negative=["man"]
)
print(result[0])  # ("queen", 0.89)

# Load pre-trained Google News vectors (1.6 GB)
from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin",
    binary=True
)
Pre-trained Word2Vec: Google released pre-trained Word2Vec vectors (3 million words, d=300, trained on 100B words of Google News). These vectors were the standard NLP initialization from 2013-2018. Download size: 1.5 GB binary format. Many researchers never trained their own Word2Vec — they just loaded the pre-trained vectors and used them as features.

Using Word2Vec for downstream tasks

Word2Vec vectors were used as features in virtually every NLP system from 2013-2018:

TaskHow Word2Vec was usedImprovement over one-hot
Sentiment analysisAverage word vectors over document, feed to classifier+5-10% accuracy
Named entity recognitionConcatenate word vector with other features for CRF+2-3 F1 points
Machine translationInitialize encoder/decoder embeddingsFaster convergence, +1-2 BLEU
Question answeringEmbed question and passages, compute similarity+5-15% accuracy
Relation extractionWord vectors as input features to CNN/LSTM+3-5 F1 points

The common pattern: take pre-trained Word2Vec vectors, use them as the input representation for a task-specific model, and fine-tune or freeze depending on dataset size. This "pre-train then fine-tune" paradigm became the standard approach and was later scaled up dramatically by ELMo, BERT, and GPT.

Bias in word vectors

Bolukbasi et al. (2016) famously showed that Word2Vec encodes societal biases:

vec("man") − vec("woman") ≈ vec("computer programmer") − vec("homemaker")
vec("man") − vec("woman") ≈ vec("doctor") − vec("nurse")

These reflect biases in the training data (Google News), not facts about the world. Debiasing techniques were developed (projecting out the gender direction), but this remains an active area of research. Every word embedding method — Word2Vec, GloVe, FastText, BERT — inherits biases from its training data.

Dimensionality and information theory

Why 300 dimensions? The paper tested d from 50 to 1000 and found 300 to be a sweet spot. There is an information-theoretic argument for why:

Going beyond 300 starts capturing corpus-specific noise rather than generalizable semantic structure. This is a form of overfitting to the training corpus.

The Word2Vec ecosystem

Word2Vec spawned an entire ecosystem of tools and extensions:

Tool/ExtensionContribution
gensim (Python)Most popular implementation for training and using Word2Vec
FastText (Facebook)Extended to character n-grams for morphology and OOV words
doc2vecExtended Word2Vec to learn document-level vectors
sense2vecDisambiguated senses: "bank_NOUN" vs. "bank_VERB"
node2vecApplied Word2Vec to graph nodes via random walks
item2vecApplied to recommendations: items as "words," sessions as "sentences"

The "2vec" naming convention became ubiquitous — any domain where you want embeddings from sequences adopted the Word2Vec framework.

What each dimension "means"

Individual dimensions of Word2Vec vectors don't correspond to interpretable features. Unlike hand-crafted features (where dimension 1 might be "is animal?" and dimension 2 might be "is living?"), Word2Vec dimensions are distributed. A single semantic concept (like "animality") is spread across many dimensions, and each dimension encodes many concepts simultaneously.

However, directions in the space can be interpretable. The vector king − queen ≈ man − woman shows that the "gender" concept corresponds to a specific direction. PCA analysis reveals that the top principal components often correspond to broad categories like frequency, concreteness, and sentiment — but individual dimensions do not.

This distributed representation is what gives Word2Vec its power: it can represent exponentially more concepts with d dimensions than a one-hot-per-concept scheme. With 300 dimensions, the space can distinguish billions of semantic nuances — far more than 300 hand-crafted features could capture.

Nearest neighbor exploration

The most intuitive way to verify that Word2Vec learned meaningful representations is to look at nearest neighbors. Here are real examples from the Google News vectors:

Query wordTop 5 nearest neighbors (cosine)
pythonpythons, snake, cobra, viper, reptile
javascriptjQuery, CSS, HTML, PHP, Ajax
berlinFrankfurt, Munich, Hamburg, Stuttgart, Dresden
guitarguitars, bass, drums, piano, acoustic
sadsaddened, heartbroken, upset, depressed, tragic
Einsteinrelativity, physicist, Bohr, quantum, Heisenberg

Notice how "python" returns snakes (the word is more common in news as an animal), while "javascript" returns web technologies. The context in the training data determines the neighborhood. If you trained on a programming corpus, "python" would neighbor "java," "ruby," and "compiler."

Closest and farthest words

Some of the most distant word pairs (lowest cosine similarity) are also revealing:

The near-zero (but not exactly zero) similarity reflects that in a 300-dimensional space, random vectors are approximately orthogonal. Unrelated words behave like random directions.

Visualizing word vectors with t-SNE

Since word vectors live in 300-dimensional space, we can't visualize them directly. t-SNE (t-distributed Stochastic Neighbor Embedding) projects them to 2D while preserving local neighborhood structure. The result shows clear semantic clusters:

python
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Select 200 words to visualize
words = ["king", "queen", "prince", "princess",
         "man", "woman", "boy", "girl",
         "cat", "dog", "fish", "bird",
         "car", "truck", "bus", "bicycle",
         # ... more words ...]

vecs = np.array([model.wv[w] for w in words])  # (200, 300)

# Project to 2D
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
vecs_2d = tsne.fit_transform(vecs)  # (200, 2)

# Plot — animals cluster together, royalty clusters together, etc.
plt.scatter(vecs_2d[:, 0], vecs_2d[:, 1])
for i, word in enumerate(words):
    plt.annotate(word, vecs_2d[i])

The t-SNE plot reveals clusters that no one programmed: animals group together, vehicles group together, royalty groups together. Gender-related words form parallel lines (king/queen, prince/princess, man/woman). This emergent structure is the hallmark of distributional word vectors.

Historical context: why 2013?

Word embeddings existed before Word2Vec. Bengio et al. (2003) had shown that neural networks could learn word vectors. Why did Word2Vec — not the 2003 paper — trigger the revolution?

  1. Scale: Bengio's NNLM could train on millions of words. Word2Vec could train on billions. The 1000x scale difference was the game-changer.
  2. Speed: Word2Vec's shallow architecture and hierarchical softmax made training fast enough to iterate quickly. Researchers could train vectors overnight and evaluate them the next day.
  3. The analogy surprise: The king − man + woman = queen result captured the imagination of the entire field. It was a vivid, memorable demonstration that word vectors had learned something deep about language.
  4. Public release: Mikolov released both the source code (word2vec.c) and pre-trained vectors. This lowered the barrier to entry enormously — anyone could download and use the vectors immediately.
  5. Timing: The deep learning revolution was just beginning. Word2Vec rode the wave of excitement around neural networks and showed that even shallow neural methods could discover rich structure in data.

The training configuration that won

The best configuration from the paper:

Model
Skip-gram (not CBOW)
Dimension
d = 1000 (diminishing returns above 300)
Data
6 billion words (Google News)
Training
Hierarchical softmax, 1 epoch
Result
66% semantic, 65% syntactic analogy accuracy

Speed comparison

ModelTraining timeWhy
NNLM (Bengio 2003)Weeks on CPUHidden layer + full softmax over V
RNNLMDays on CPURecurrent computation is sequential
CBOWHours on CPUNo hidden layer, hier. softmax
Skip-gram~1 day on CPUShallow model + hier. softmax + multi-threading
Training Speed Comparison

Word2Vec's shallow architecture enables training that is orders of magnitude faster than prior neural language models on the same data.

Word2Vec achieved a 100-1000x speedup over prior neural language models. This speedup came entirely from simplifying the model (removing the hidden layer), not from better hardware. The lesson: simpler models can learn from more data, and more data wins.

What was the single most important factor for improving word vector quality in the paper's experiments?

Chapter 9: Connections

What Word2Vec built on

Neural language models (Bengio et al., 2003): The first neural network to learn word embeddings as a byproduct of language modeling. Word2Vec stripped away the hidden layers and the language modeling objective, keeping only the embedding learning — making it vastly more scalable.

Latent Semantic Analysis (LSA) (Deerwester et al., 1990): The original count-based approach to word vectors. Build a word-document co-occurrence matrix and reduce dimensionality with SVD. Word2Vec achieved better results by predicting rather than counting.

Distributional semantics (Firth, 1957; Harris, 1954): The linguistic theory that meaning comes from usage patterns. Word2Vec operationalized this with a neural network that turned the theory into a practical algorithm.

What Word2Vec enabled

Word2Vec extensions (Mikolov et al., 2013b): The follow-up paper introducing negative sampling, subsampling, and phrase vectors. Covered in our companion veanor.

GloVe (Pennington et al., 2014): Showed that Word2Vec's Skip-gram implicitly factorizes a word-context co-occurrence matrix. GloVe made this explicit by directly optimizing a weighted least-squares objective on co-occurrence counts. See our GloVe veanor.

FastText (Bojanowski et al., 2017): Extended Word2Vec to use character n-grams, enabling embeddings for out-of-vocabulary words. The vector for "unfairness" is the sum of subword vectors like "un," "fair," "ness."

ELMo (Peters et al., 2018): Moved beyond static word vectors to context-dependent representations using a bidirectional LSTM. "Bank" gets different vectors in "river bank" vs. "bank account."

BERT and Transformers (Devlin et al., 2018): Transformers produce fully contextualized embeddings. Every token gets a unique vector based on the full input. The embedding layer of a transformer is essentially a Word2Vec-style lookup table — the connection is direct.

The Transformer embedding connection

The very first layer of every transformer (GPT, BERT, LLaMA) is an embedding lookup table — exactly a Word2Vec-style matrix W of shape (V, d). Token ID in, dense vector out. The difference is what happens after: transformers stack attention layers on top of the initial embeddings, producing context-dependent representations. But the initial embedding is still a static lookup, learned end-to-end.

python
import torch.nn as nn

# In a transformer, the first layer is literally Word2Vec's W_in:
embedding = nn.Embedding(vocab_size, d_model)  # shape: (V, d)

# Token IDs → dense vectors, exactly like Word2Vec
token_ids = torch.tensor([42, 17, 8391])
embeddings = embedding(token_ids)  # shape: (3, d_model)

# Then transformers add positional encoding + self-attention layers
# Word2Vec stops here; transformers build context on top

Word2Vec's lasting legacy

The paradigm shift: Word2Vec demonstrated three things that reshaped NLP. (1) Simple prediction tasks on raw text learn rich representations. (2) Linear structure in embedding spaces encodes semantic relationships. (3) Scale (data + efficient algorithms) matters more than model complexity. These insights directly foreshadow the transformer era: GPT and BERT are, at their core, prediction machines learning representations from vast amounts of text. Word2Vec was the proof of concept.

Known limitations

Cheat sheet

Core idea
Learn word vectors by predicting context (Skip-gram) or center (CBOW) words using a shallow neural network
Two architectures
CBOW: context → center (fast, good for frequent words). Skip-gram: center → context (better for rare words)
Key property
Vector arithmetic: king − man + woman ≈ queen. Semantic relationships are directions in the space
Scaling
Hierarchical softmax: O(log V) vs O(V). Trained on 6B words in a day. More data > bigger model
Impact
Launched the "word embeddings" era. Directly enabled GloVe, FastText, ELMo, and the transformer revolution
What are the three key insights from Word2Vec that foreshadowed the transformer era?