Word2Vec — Veanors

Chapter 1: One-Hot Encoding

Before Word2Vec, the standard way to feed words into a neural network was the one-hot vector. If your vocabulary has V words, you represent each word as a vector of length V with exactly one entry set to 1 and all others set to 0.

For example, with a tiny vocabulary of 5 words:

"cat"

[1, 0, 0, 0, 0]

"dog"

[0, 1, 0, 0, 0]

"fish"

[0, 0, 1, 0, 0]

"king"

[0, 0, 0, 1, 0]

"queen"

[0, 0, 0, 0, 1]

The problem is now mathematically precise. The dot product of any two different one-hot vectors is exactly zero:

one_hot("cat") · one_hot("dog") = 0

one_hot("king") · one_hot("queen") = 0

"Cat" is as different from "dog" as "king" is from "fish." Every pair of words is equally dissimilar. This is the orthogonality problem: one-hot vectors are mutually orthogonal, so no similarity structure exists.

With a real vocabulary, this gets worse. Google's Word2Vec was trained with V = 1,000,000. That means each word is a million-dimensional vector with 999,999 zeros. This is wasteful, uninformative, and computationally prohibitive for any downstream model to process directly.

The key realization: A one-hot vector is just a lookup key. It says "I am word #4,271" but carries zero information about what word #4,271 means. What we want is a distributed representation — a dense vector where every dimension contributes a little bit of meaning. In such a space, "cat" and "kitten" would have similar vectors because they share semantic features: they're animals, they're small, they're pets.

Dimensionality comparison

The contrast between one-hot and dense representations is stark:

Property	One-hot	Dense embedding
Dimensions	V (e.g., 1,000,000)	d (e.g., 300)
Non-zero entries	1	All d
Memory per word	V × 4 bytes = 4 MB	d × 4 bytes = 1.2 KB
Dot product meaning	0 for all pairs (no info)	Similarity score
Information content	log₂(V) bits (just an ID)	~300 floats of learned semantics

The embedding idea

What if we learned a matrix W of shape (V × d), where d ≪ V? Each row of W is a d-dimensional vector for one word. To get the embedding for word i, we simply multiply:

e_i = one_hot(i) · W = W[i, :]

The one-hot multiplication is just a row lookup. But if we learn the values in W by training on a useful task, the resulting rows will encode meaning. This is the core of Word2Vec: learn W by predicting words from their context.

One-Hot to Dense Embedding

Click a word to see its one-hot vector multiplied by the embedding matrix W. The result is a dense, low-dimensional vector. Drag the slider to change the embedding dimension d.

Embed dim d 4

python
import numpy as np

V = 10000   # vocabulary size
d = 300     # embedding dimension

# The embedding matrix (randomly initialized, then learned)
W = np.random.randn(V, d) * 0.01

# One-hot lookup is just row indexing
word_id = 42
one_hot = np.zeros(V)
one_hot[word_id] = 1

# These two are equivalent:
embedding_matmul = one_hot @ W        # shape: (d,)
embedding_lookup = W[word_id]          # shape: (d,) — just a row lookup
assert np.allclose(embedding_matmul, embedding_lookup)

Why is multiplying a one-hot vector by an embedding matrix equivalent to a simple row lookup?

The one-hot vector has a 1 at position i and 0s everywhere else, so the matrix multiply selects row i and zeros out all other rows The matrix has a special diagonal structure One-hot vectors are normalized, so multiplication preserves the row

Chapter 2: The Distributional Hypothesis

We want to learn dense word vectors where similar words end up close together. But what does "similar" mean? How do we decide that "cat" should be near "kitten" and far from "democracy"?

The answer comes from a beautifully simple idea from linguistics, articulated by J.R. Firth in 1957:

"You shall know a word by the company it keeps." Words that appear in similar contexts have similar meanings. "Cat" and "kitten" both appear near "pet," "fluffy," "purr," "milk." "Democracy" appears near "government," "vote," "election." The contexts define the meaning.

This is the distributional hypothesis. It says that semantic similarity can be derived entirely from patterns of co-occurrence in text. You do not need dictionaries, ontologies, or human annotations. You just need a lot of text.

Consider these sentences:

Sentence 1

The cat sat on the mat and purred.

Sentence 2

The kitten sat on the rug and purred.

Sentence 3

The dog sat on the mat and barked.

Sentence 4

The president sat at the desk and signed.

"Cat" and "kitten" share almost identical contexts (sat, on, the, mat/rug, purred). "Dog" shares most context with "cat" (sat, on, the, mat) but differs in one key word (barked vs. purred). "President" shares far less context. From this, a good algorithm should learn: cat ≈ kitten > dog > president, in terms of similarity.

The context window

Word2Vec operationalizes this by defining a context window of size c around each target word. For the sentence "the cat sat on the mat" with c = 2, the context of "sat" is {the, cat, on, the} — the two words before and after.

Sliding Context Window

Watch the context window slide across a sentence. The center word is the target (orange); the highlighted words are the context (teal). Drag the slider to change window size c.

Window c 2

Target position 3

Word2Vec's genius is turning the distributional hypothesis into a prediction task. Instead of counting co-occurrences (the traditional approach), it predicts words from their context. The prediction errors shape the embedding vectors. Two words that can substitute for each other in many contexts end up with similar vectors — because similar vectors make similar predictions.

From counting to predicting

Before Word2Vec, the distributional hypothesis was applied through count-based methods: build a large matrix of word co-occurrence counts, then reduce its dimensionality with SVD. This worked but had limitations:

The co-occurrence matrix is huge (V × V) and sparse
SVD on large matrices is computationally expensive
The resulting vectors performed well on word similarity but poorly on analogies

Word2Vec showed that you can skip the matrix entirely. Instead of counting how often words co-occur and then compressing the counts, just train a model to predict co-occurrence. The embeddings learned by prediction turn out to have richer structure than those learned by counting — specifically, the linear analogy structure (king − man + woman ≈ queen) that made Word2Vec famous.

Window size matters

The window size c controls what kind of similarity the vectors capture:

Window size	Type of similarity	Example neighbors of "dog"
c = 1-2 (narrow)	Syntactic	cat, horse, bird (same syntactic role)
c = 5-10 (wide)	Semantic	puppy, pet, canine (same meaning)
c = 15+ (very wide)	Topical	veterinarian, leash, breed (same topic)

Narrow windows find words that can substitute for each other in a sentence. Wide windows find words that appear in the same kinds of documents. The paper's default of c = 5-10 captures a mix of syntactic and semantic similarity.

Counting context features: a manual exercise

To build intuition, let's manually trace what Word2Vec would learn from a tiny corpus. Consider just four sentences:

"the cat sat on the mat"
"the cat ate the fish"
"the dog sat on the rug"
"the dog ate the bone"

With c = 1 (looking at only the immediately adjacent words), the context for each word includes:

Target word	Observed contexts (c=1)
cat	the(4x), sat(1x), ate(1x)
dog	the(4x), sat(1x), ate(1x)
sat	cat(1x), dog(1x), on(2x)
ate	cat(1x), dog(1x), the(2x)
mat	the(1x)
fish	the(1x)

"Cat" and "dog" have identical context distributions! They both appear after "the" and before "sat" and "ate." Word2Vec would give them nearly identical vectors. Meanwhile, "sat" and "ate" also have similar contexts (both appear after "cat" and "dog"), so they'd be close too — but slightly different because "sat" co-occurs with "on" while "ate" co-occurs with "the."

This simple example illustrates the distributional hypothesis in action: words with the same syntactic and semantic roles get the same contexts, and therefore the same vectors.

According to the distributional hypothesis, why would "cat" and "kitten" get similar word vectors?

They appear in similar contexts — surrounded by similar words like "pet," "fluffy," "purr" — so a model that learns from context will assign them similar vectors They have similar spelling They are defined together in dictionaries

Chapter 3: The CBOW Architecture

The first of Word2Vec's two models is Continuous Bag-of-Words (CBOW). The task: given the context words around a gap, predict the missing center word.

For the sentence "the cat sat on the mat" with window c = 2, if the target is "sat" (position 2), the training example is:

Input (context)

{"the", "cat", "on", "the"}

↓ predict

Output (target)

"sat"

The architecture is a single hidden layer neural network:

Step 1: Look up context embeddings

Each context word is mapped to its embedding using the input matrix W_in (shape V × d). This is just a row lookup for each context word.

Step 2: Average the context vectors

CBOW takes the average of all context word embeddings. This is the "bag-of-words" part — order does not matter, only which words are present.

h = (1 / 2c) ∑_{i ∈ context} W_in[i, :]

Where h is the hidden representation (shape d).

Step 3: Predict the target word

The hidden vector h is multiplied by an output matrix W_out (shape d × V) to produce scores for every word in the vocabulary, then softmax converts these to probabilities:

scores = h · W_out

P(w_j | context) = exp(score_j) / ∑_k=1^V exp(score_k)

Step 4: Compute the loss

The loss is the negative log-probability of the true target word:

L = −log P(w_target | context)

Why "continuous"? Earlier bag-of-words models used discrete features (word counts). CBOW uses continuous embeddings — dense real-valued vectors — hence the name. The "bag" part means it ignores word order within the context window.

CBOW Forward Pass

Context words are embedded, averaged into a hidden vector h, then projected to a probability distribution over the vocabulary. Click "Step" to walk through each stage.

Click Step to begin

python
import numpy as np

def cbow_forward(context_ids, W_in, W_out):
    """
    context_ids: list of int, shape (2*c,) — IDs of context words
    W_in:  (V, d) — input embedding matrix
    W_out: (d, V) — output projection matrix
    Returns: probability distribution over vocab, shape (V,)
    """
    # Step 1: look up embeddings for each context word
    embeddings = W_in[context_ids]   # shape: (2c, d)

    # Step 2: average them
    h = embeddings.mean(axis=0)     # shape: (d,)

    # Step 3: project to vocabulary scores
    scores = h @ W_out               # shape: (V,)

    # Step 4: softmax
    exp_scores = np.exp(scores - scores.max())
    probs = exp_scores / exp_scores.sum()

    return probs  # P(w | context) for each word w

Two matrices, not one. CBOW uses two separate matrices: W_in for input embeddings (context words) and W_out for output predictions (target words). After training, we typically keep only W_in as the final word vectors. Some practitioners average W_in and W_out for slightly better results.

Worked example: CBOW gradient update

Let's trace a complete forward and backward pass with a tiny vocabulary of 3 words and embedding dimension d = 2.

Vocabulary: {cat = 0, sat = 1, mat = 2}. Context = {cat, mat}, target = sat. Window c = 1.

Suppose our current matrices are:

W_in = [[0.5, −0.2], [0.1, 0.8], [−0.3, 0.4]]

W_out = [[0.3, 0.6], [−0.1, 0.5], [0.2, −0.3]]

Forward pass:

Look up context embeddings: v_cat = [0.5, −0.2], v_mat = [−0.3, 0.4]
Average: h = ([0.5, −0.2] + [−0.3, 0.4]) / 2 = [0.1, 0.1]
Scores: s = W_out · h = [0.09, 0.04, −0.01]
Softmax: exp(s) = [1.094, 1.041, 0.990], sum = 3.125, P = [0.350, 0.333, 0.317]
Loss: L = −log P(sat) = −log(0.333) = 1.099

Backward pass:

Error vector: e = P − one_hot(sat) = [0.350, −0.667, 0.317]
Gradient for W_out[sat]: ∂L/∂v'_sat = e[1] · h = −0.667 · [0.1, 0.1] = [−0.067, −0.067]. This pushes the "sat" output vector toward h.
Gradient for h: ∂L/∂h = W_out^T · e = sum of error-weighted output vectors
Distribute ∂L/∂h equally to each context word's input embedding

python
# Complete CBOW training step with backpropagation
import numpy as np

V, d = 3, 2
W_in = np.array([[0.5, -0.2], [0.1, 0.8], [-0.3, 0.4]])
W_out = np.array([[0.3, 0.6], [-0.1, 0.5], [0.2, -0.3]])

context_ids = [0, 2]  # cat, mat
target_id = 1         # sat
lr = 0.1

# Forward
h = W_in[context_ids].mean(axis=0)           # [0.1, 0.1]
scores = W_out @ h                             # [0.09, 0.04, -0.01]
exp_s = np.exp(scores - scores.max())
probs = exp_s / exp_s.sum()                    # [0.350, 0.333, 0.317]
loss = -np.log(probs[target_id])               # 1.099

# Backward
error = probs.copy()
error[target_id] -= 1                         # [0.35, -0.67, 0.32]

# Gradient for output matrix: outer product of error and h
dW_out = np.outer(error, h)                    # (3, 2)
W_out -= lr * dW_out

# Gradient for hidden: project error back through output matrix
dh = W_out.T @ error                           # (2,)
# Distribute equally to all context words (CBOW averages, so grad is shared)
for cid in context_ids:
    W_in[cid] -= lr * dh / len(context_ids)

print(f"Loss before: {loss:.3f}")  # 1.099
print(f"P(sat|context) before: {probs[target_id]:.3f}")  # 0.333

What the gradient does geometrically

After one gradient step, three things happen simultaneously:

The output vector v'_sat moves toward h (the averaged context). Future predictions of "sat" from similar contexts will be more confident.
The output vectors v'_cat and v'_mat move away from h. These words were predicted but shouldn't have been.
The input vectors v_cat and v_mat adjust to make h point more toward v'_sat. The context embeddings shift to better predict the center word.

Over millions of training examples, this push-pull dynamic shapes the embedding space. Words that frequently predict each other (because they share contexts) have their vectors pulled together. Words that should not predict each other have their vectors pushed apart.

In CBOW, what is the hidden representation h?

The average of the embedding vectors of all context words within the window The concatenation of all context word embeddings The one-hot vector of the target word

Chapter 4: The Skip-gram Architecture

Skip-gram flips CBOW on its head. Instead of predicting the center word from the context, it predicts each context word from the center word. Same data, opposite direction.

For the same sentence "the cat sat on the mat" with window c = 2 and target "sat":

Input (center)

"sat"

↓ predict each of

Output (context)

"the", "cat", "on", "the"

This generates 2c training pairs from each word position. For c = 2, each center word produces 4 pairs: (sat, the), (sat, cat), (sat, on), (sat, the).

The architecture

Even simpler than CBOW — no averaging needed:

Step 1: Look up the embedding of the center word from W_in:

h = W_in[w_center, :]

Step 2: For each context position, compute the probability of each vocabulary word using the output matrix W_out:

P(w_j | w_center) = exp(W_out[j, :] · h) / ∑_k=1^V exp(W_out[k, :] · h)

Step 3: The loss sums over all context positions:

L = −∑_{j ∈ context} log P(w_j | w_center)

Why Skip-gram often works better

Skip-gram treats each (center, context) pair as a separate training example, so it generates more training signal per word. For a sentence of length T with window c, Skip-gram produces ~2cT training pairs versus T for CBOW. This extra data is especially helpful for rare words: a rare word appears as a center word 2c times more often in Skip-gram's training data (once per context position) than in CBOW's (once total).

Worked example: Skip-gram training pairs

For the sentence "the cat sat on the mat" with c = 2, here are all the training pairs generated by Skip-gram:

Center word	Context word	Distance
"the" (pos 0)	"cat" (pos 1)	1
"the" (pos 0)	"sat" (pos 2)	2
"cat" (pos 1)	"the" (pos 0)	1
"cat" (pos 1)	"sat" (pos 2)	1
"cat" (pos 1)	"on" (pos 3)	2
"sat" (pos 2)	"the" (pos 0)	2
"sat" (pos 2)	"cat" (pos 1)	1
"sat" (pos 2)	"on" (pos 3)	1
"sat" (pos 2)	"the" (pos 4)	2
... and so on for "on", "the", "mat"

This 6-word sentence generates approximately 6 × 2 × 2 = 24 training pairs. A corpus of 6 billion words with c = 5 generates ~60 billion pairs. Each pair provides a gradient signal that shapes the embedding space.

The independence assumption

Skip-gram makes a strong simplifying assumption: each context word is predicted independently. The objective sums over context positions without any interaction:

L = −∑_{j ∈ context} log P(w_j | w_center)

This means Skip-gram cannot capture the fact that "cat" and "mat" both being in the context is more informative than either alone. This is a deliberate simplification for computational efficiency — and it works remarkably well despite the approximation.

CBOW vs. Skip-gram: CBOW is faster to train (one prediction per window position) and works well for frequent words. Skip-gram is slower (2c predictions per position) but produces better embeddings for rare words. The paper found Skip-gram slightly outperformed CBOW on semantic tasks, while CBOW was better on syntactic tasks. In practice, Skip-gram became the more popular choice.

CBOW vs. Skip-gram

Toggle between architectures to see how data flows differently. CBOW: many-to-one (context predicts center). Skip-gram: one-to-many (center predicts context).

python
def skipgram_forward(center_id, W_in, W_out):
    """
    center_id: int — ID of center word
    W_in:  (V, d) — input embedding matrix
    W_out: (V, d) — output embedding matrix (note: transposed vs CBOW)
    Returns: probability distribution over vocab, shape (V,)
    """
    # Step 1: look up center word embedding
    h = W_in[center_id]          # shape: (d,)

    # Step 2: score every vocabulary word
    scores = W_out @ h            # shape: (V,) — dot product of each output vector with h

    # Step 3: softmax
    exp_scores = np.exp(scores - scores.max())
    probs = exp_scores / exp_scores.sum()

    return probs  # P(w_context | w_center) for each word

# Training: for each center word, predict each context word
for center_pos in range(len(sentence)):
    center_id = sentence[center_pos]
    for offset in range(-c, c+1):
        if offset == 0: continue
        context_pos = center_pos + offset
        if 0 <= context_pos < len(sentence):
            context_id = sentence[context_pos]
            probs = skipgram_forward(center_id, W_in, W_out)
            loss = -np.log(probs[context_id] + 1e-10)

How does Skip-gram differ from CBOW in its prediction task?

CBOW predicts the center word from the context words (many-to-one), while Skip-gram predicts each context word from the center word (one-to-many) Skip-gram uses a deeper network with more layers Skip-gram uses character-level features instead of word-level

Chapter 5: The Training Objective

Both CBOW and Skip-gram end with a softmax over the entire vocabulary. Let's understand exactly what this means computationally — and why it becomes a bottleneck.

The full softmax

For Skip-gram, the probability of context word w_O given center word w_I is:

P(w_O | w_I) = exp(v'_{w_O} · v_{w_I}) / ∑_j=1^V exp(v'_j · v_{w_I})

Here, v_w is the input vector (from W_in) and v'_w is the output vector (from W_out) for word w. The dot product v'_{w_O} · v_{w_I} measures how well the center word predicts this particular context word.

The computational cost

The denominator sums over all V words. For V = 1,000,000, every single training step requires computing 1 million dot products and exponentials. With a corpus of billions of words and multiple context positions per word, this becomes impossibly slow.

Cost per step = O(V · d)

For V = 10⁶ and d = 300, that is 300 million floating-point operations just for the softmax — per training example. With billions of training examples, full softmax training would take years.

The gradient

The gradient of the loss with respect to the output vector v'_j is:

∂L/∂v'_j = (P(j | w_I) − δ_{j=w_O}) · v_{w_I}

Where δ_{j=w_O} is 1 if j is the actual context word and 0 otherwise. This gradient has a beautiful interpretation: for the correct context word (j = w_O), the gradient pushes v'_j closer to v_{w_I} by an amount proportional to 1 − P(j | w_I). For all other words, the gradient pushes v'_j away from v_{w_I} by an amount proportional to P(j | w_I).

The softmax bottleneck: Every training step must update the output vectors of all V words, even though only one word is the correct answer. With V = 1 million, 999,999 words get a "push away" gradient. This is spectacularly wasteful. Chapter 6 covers the tricks that make training feasible: hierarchical softmax and negative sampling.

Numerical stability: the log-sum-exp trick

Computing softmax naively causes overflow. If any score is large (say 500), exp(500) = ∞. The standard trick: subtract the maximum score before exponentiating.

P(w_j) = exp(s_j − max(s)) / ∑_k exp(s_k − max(s))

This gives the same probabilities (the max cancels in the ratio) but keeps all exponentials in a numerically safe range. Every softmax implementation uses this trick.

Why not just use cross-entropy on one-hot labels?

The loss −log P(w_target | w_center) is exactly the cross-entropy between the one-hot label vector and the softmax output. This is the same loss used in standard classification. Word2Vec is classification — it classifies the center word's context into one of V categories. The word vectors are a byproduct of learning to classify well.

This insight — that useful representations emerge as byproducts of prediction tasks — is one of the most important ideas in modern ML. It foreshadows BERT (predict masked tokens), GPT (predict next token), and contrastive learning (predict matching pairs).

Worked example: softmax step by step

Vocabulary: V = 4 words: {cat=0, dog=1, sat=2, mat=3}. Center word: "sat" (id=2). True context: "cat" (id=0). Embedding dim d = 2.

Current vectors:

v_sat = [0.5, −0.3] (input)

v'_cat = [0.4, 0.2], v'_dog = [0.1, −0.5], v'_sat = [−0.2, 0.1], v'_mat = [0.3, 0.6]

Dot products (scores):

s_cat = 0.5·0.4 + (−0.3)·0.2 = 0.20 − 0.06 = 0.14

s_dog = 0.5·0.1 + (−0.3)·(−0.5) = 0.05 + 0.15 = 0.20

s_sat = 0.5·(−0.2) + (−0.3)·0.1 = −0.10 − 0.03 = −0.13

s_mat = 0.5·0.3 + (−0.3)·0.6 = 0.15 − 0.18 = −0.03

Softmax (subtract max = 0.20 for stability):

exp([0.14−0.20, 0, −0.13−0.20, −0.03−0.20]) = exp([−0.06, 0, −0.33, −0.23])

= [0.942, 1.000, 0.719, 0.795]

Sum = 3.456

P = [0.273, 0.289, 0.208, 0.230]

Loss:

L = −log P(cat) = −log(0.273) = 1.299

The model gives "cat" a 27.3% probability. Not terrible for random initialization (uniform would be 25%), but there's room for improvement. The gradient will push v'_cat toward v_sat and push the other output vectors away.

Gradient for each output vector v'_j:

∂L/∂v'_cat = (0.273 − 1) · v_sat = −0.727 · [0.5, −0.3] = [−0.364, 0.218]

∂L/∂v'_dog = (0.289 − 0) · v_sat = 0.289 · [0.5, −0.3] = [0.145, −0.087]

∂L/∂v'_sat = (0.208 − 0) · v_sat = 0.208 · [0.5, −0.3] = [0.104, −0.062]

∂L/∂v'_mat = (0.230 − 0) · v_sat = 0.230 · [0.5, −0.3] = [0.115, −0.069]

After subtracting lr · gradient: v'_cat moves 0.036 toward v_sat. The other three output vectors each move slightly away from v_sat. Over millions of such updates, the geometry of the embedding space takes shape.

Softmax Computation Cost

The denominator requires summing over all V words. Drag the vocabulary size to see how cost scales linearly. The red zone marks where full softmax becomes impractical.

Vocab V 10K

python
def skipgram_loss_and_gradient(center_id, context_id, W_in, W_out):
    """Full softmax loss and gradient for one (center, context) pair."""
    v_in = W_in[center_id]           # (d,) — center embedding
    scores = W_out @ v_in             # (V,) — dot product with every output vector

    # Softmax — this sum over V is the bottleneck
    exp_scores = np.exp(scores - scores.max())
    probs = exp_scores / exp_scores.sum()  # (V,)

    # Loss: negative log-prob of the true context word
    loss = -np.log(probs[context_id] + 1e-10)

    # Gradient w.r.t. output vectors
    grad_out = probs.copy()             # (V,) — P(j|center) for all j
    grad_out[context_id] -= 1           # subtract 1 for the true word
    dW_out = np.outer(grad_out, v_in)  # (V, d) — updates ALL V output vectors

    # Gradient w.r.t. input vector of center word
    dv_in = W_out.T @ grad_out          # (d,)

    return loss, dv_in, dW_out

Why is the full softmax computationally expensive for Word2Vec?

The denominator requires computing dot products and exponentials for all V words in the vocabulary (e.g., 1 million), at every single training step The softmax function is numerically unstable The embedding dimension d is too large

Chapter 6: Efficiency Tricks

The original Word2Vec paper introduced hierarchical softmax as the first solution to the softmax bottleneck. The follow-up paper (Mikolov et al., 2013b) introduced negative sampling, which became more popular. We cover hierarchical softmax here; negative sampling gets its own veanor.

Hierarchical softmax

Instead of a flat softmax over V words, arrange the vocabulary as a binary tree. Each word is a leaf. To compute the probability of a word, you walk from the root to that leaf, making a binary (left/right) decision at each internal node.

At each internal node n, you compute:

P(go left at node n) = σ(v'_n · h)

P(go right at node n) = 1 − σ(v'_n · h)

The probability of reaching a word w at depth L is the product of the binary decisions along the path:

P(w) = ∏_j=1^L σ(s_j · v'_{n_j} · h)

where s_j ∈ {−1, +1} encodes whether you go left or right at node j.

Why is this fast?

A balanced binary tree has depth log₂(V). So instead of summing over V terms, you compute log₂(V) sigmoids. For V = 1,000,000:

Full softmax: O(V) = 1,000,000 operations

Hierarchical softmax: O(log V) = 20 operations

That is a 50,000x speedup.

Huffman tree

The paper uses a Huffman tree based on word frequency, not a balanced binary tree. Frequent words get shorter paths (fewer decisions), rare words get longer paths. Since frequent words appear more often in training, this minimizes the average number of operations per step.

Hierarchical softmax is not an approximation. It computes the exact probability distribution over words — it's just organized as a tree of binary classifiers instead of a flat softmax. The probabilities still sum to 1 across all leaves. The tree structure only changes the computational complexity, not the mathematical result.

Worked example: probability computation

Consider a tiny Huffman tree with 4 words: the(common), cat, sat, zinc(rare). In a Huffman tree, "the" gets a short path (1 decision) and "zinc" gets a long path (3 decisions).

Suppose the hidden vector is h = [0.5, 0.3] and the tree nodes have vectors:

Root: v_root = [0.8, −0.2]. σ(v_root · h) = σ(0.4 − 0.06) = σ(0.34) = 0.584
Left child: v_left = [0.1, 0.6]. σ(v_left · h) = σ(0.05 + 0.18) = σ(0.23) = 0.557
Right child: v_right = [−0.3, 0.9]. σ(v_right · h) = σ(−0.15 + 0.27) = σ(0.12) = 0.530

Then:

P("the") = P(left at root) = σ(0.34) = 0.584

P("cat") = P(right at root) × P(left at right child) = (1 − 0.584) × 0.530 = 0.416 × 0.530 = 0.220

Only 1 sigmoid for "the" (most common word) vs. 2 for "cat." The Huffman tree ensures the average path length is minimized based on word frequencies — an optimal prefix code.

Training with hierarchical softmax

During backpropagation, only the node vectors along the path from root to the target word receive gradient updates. For a target word at depth L, only L node vectors are updated (plus the center word's input vector). Compared to full softmax which updates all V output vectors, this is a massive saving.

Implementation detail: the unigram table

For efficient sampling of negative words, the original Word2Vec implementation precomputes a large "unigram table" — an array of 100 million entries where each word appears proportionally to f(w)^3/4. To sample a negative word, generate a random integer index into this table and look up the word. This is O(1) per sample, much faster than computing cumulative distributions.

python
# Build unigram table for O(1) negative sampling
import numpy as np

def build_unigram_table(word_freqs, table_size=100_000_000, power=0.75):
    """Build a lookup table for O(1) negative sampling."""
    powered = word_freqs ** power
    norm = powered / powered.sum()

    table = np.zeros(table_size, dtype=np.int32)
    idx = 0
    cumulative = 0.0

    for word_id in range(len(word_freqs)):
        cumulative += norm[word_id]
        while idx < table_size and idx / table_size < cumulative:
            table[idx] = word_id
            idx += 1

    return table

# Sampling is now O(1):
# neg_word = table[random.randint(0, table_size)]

Hyperparameter summary

Hyperparameter	Paper default	Effect
Embedding dim d	300	Quality increases up to ~300, then plateaus
Window size c	5 (SG), 5 (CBOW)	Larger = more semantic; smaller = more syntactic
Learning rate	0.025 (SG), 0.05 (CBOW)	Linearly decayed during training
Min word count	5	Words below this threshold are discarded
Negative samples k	5-20	More = slower but better for small data
Subsampling t	10⁻⁵	Threshold for discarding frequent words
Training epochs	1-5	More epochs on small data; 1 pass on large data

Hierarchical Softmax Tree

A binary tree over the vocabulary. To compute P("cat"), follow the path from root to the "cat" leaf. Each node makes a binary decision using a sigmoid. Click a leaf word to highlight its path.

python
def hierarchical_softmax(word_id, h, tree_vectors, tree_paths, tree_codes):
    """
    word_id:      target word
    h:            hidden vector (d,) — from center/context embedding
    tree_vectors: (num_internal_nodes, d) — learned vectors at each node
    tree_paths:   list of lists — path from root to each word
    tree_codes:   list of lists — left(0)/right(1) at each step
    """
    path = tree_paths[word_id]     # e.g., [0, 3, 7] — internal node IDs
    codes = tree_codes[word_id]    # e.g., [1, 0, 1] — go right, left, right

    log_prob = 0.0
    for node_id, code in zip(path, codes):
        v_node = tree_vectors[node_id]      # (d,)
        dot = v_node @ h                     # scalar
        sig = 1.0 / (1.0 + np.exp(-dot))      # sigmoid

        if code == 0:   # go left
            log_prob += np.log(sig + 1e-10)
        else:            # go right
            log_prob += np.log(1 - sig + 1e-10)

    return log_prob   # log P(word | h) — only log2(V) operations!

How much speedup does hierarchical softmax provide over full softmax for a vocabulary of 1 million words?

About 50,000x — from O(V) = 1,000,000 operations to O(log V) = 20 operations per training step About 2x — it halves the computation About 10x — it reduces the vocabulary size by a factor of 10

Chapter 7: Vector Arithmetic

This is the most surprising result of the paper — the reason Word2Vec captured the imagination of the entire ML community. The learned word vectors exhibit linear algebraic structure that encodes semantic relationships.

The most famous example:

vec("king") − vec("man") + vec("woman") ≈ vec("queen")

This is not a cherry-picked curiosity. The paper found systematic, consistent vector relationships across many semantic categories:

Relationship	Example	Vector operation
Gender	king : queen :: man : woman	king − man + woman ≈ queen
Country-Capital	France : Paris :: Japan : Tokyo	France − Paris + Tokyo ≈ Japan
Tense	walking : walked :: swimming : swam	walking − walked + swam ≈ swimming
Comparative	big : bigger :: small : smaller	big − bigger + smaller ≈ small
Plural	car : cars :: apple : apples	car − cars + apples ≈ apple

How this is computed

Given an analogy "a is to b as c is to ?", we compute:

vec(?) = vec(b) − vec(a) + vec(c)

Then we find the word whose vector is closest to vec(?) by cosine similarity:

answer = argmax_w cos(vec(w), vec(?)) = argmax_w (vec(w) · vec(?)) / (||vec(w)|| · ||vec(?)||)

(excluding a, b, c themselves from the search).

Why does this work?

The training objective forces the model to capture co-occurrence patterns. "King" and "queen" appear in similar contexts (royal, throne, crown) but differ in gender-related contexts. This difference is captured as a consistent direction in the vector space — the same direction that separates "man" from "woman."

More formally, if a direction d encodes "maleness" then:

vec("king") ≈ vec("royalty") + d

vec("queen") ≈ vec("royalty") − d

vec("man") ≈ vec("person") + d

vec("woman") ≈ vec("person") − d

Subtracting "man" from "king" removes the "person + d" component and leaves "royalty − person". Adding "woman" gives "royalty − person + person − d = royalty − d = queen". The direction d cancels in the subtraction and reappears in the addition.

The deeper lesson: Linear relationships in word vectors emerge because the training objective (predicting context) forces the model to encode semantic features as directions in the embedding space. Each dimension is not explicitly assigned a meaning, but the space self-organizes so that meaningful relationships correspond to vector arithmetic. This was the first strong evidence that simple prediction tasks on raw text could discover rich semantic structure automatically.

Cosine similarity vs. Euclidean distance

Word2Vec uses cosine similarity rather than Euclidean distance to measure word similarity. Why?

Euclidean distance is sensitive to vector magnitude. A word that appears frequently tends to have a larger-magnitude vector (more training updates push it further from the origin). Cosine similarity normalizes by magnitude, measuring only the angle between vectors:

cos(u, v) = (u · v) / (||u|| · ||v||) = ∑_i u_iv_i / (√∑_iu_i² · √∑_iv_i²)

Range: −1 (opposite directions) to +1 (same direction). Two words with cosine similarity 0.9 are nearly interchangeable in many contexts. Words with similarity 0.1 are semantically distant.

Typical similarity values

Word pair	Cosine similarity	Interpretation
cat — kitten	0.89	Very similar (same concept)
cat — dog	0.76	Similar (both pets/animals)
cat — car	0.21	Unrelated
king — queen	0.73	Related (both royalty, different gender)
king — man	0.62	Moderately related (both male, different status)
good — bad	0.72	High! Antonyms are close because they appear in similar contexts

Cosine Similarity in Practice

Real cosine similarities between word pairs in Word2Vec. Notice that "good" and "bad" (antonyms) are surprisingly similar because they appear in identical syntactic contexts.

Antonyms are close. A surprising property: "good" and "bad" have high cosine similarity (~0.72) because they appear in nearly identical syntactic contexts ("the movie was good/bad," "it's not good/bad"). The distributional hypothesis captures substitutability, not opposition. This is a well-known limitation of static word vectors — they conflate different types of semantic relationships.

Word Vector Analogy Explorer

Explore word analogies in 2D. The orange arrow (king − man) is the "royalty" direction. Adding it to "woman" lands near "queen." Click different analogy types to see other relationships.

python
import numpy as np

def analogy(a, b, c, word_vectors, vocab):
    """
    Solve: a is to b as c is to ?
    word_vectors: dict mapping word -> np.array of shape (d,)
    vocab: list of all words
    """
    # Compute the target vector: b - a + c
    target = word_vectors[b] - word_vectors[a] + word_vectors[c]

    # Normalize the target
    target = target / (np.linalg.norm(target) + 1e-10)

    best_word, best_sim = None, -1
    exclude = {a, b, c}

    for word in vocab:
        if word in exclude:
            continue
        vec = word_vectors[word]
        vec_norm = vec / (np.linalg.norm(vec) + 1e-10)
        sim = np.dot(target, vec_norm)  # cosine similarity
        if sim > best_sim:
            best_sim = sim
            best_word = word

    return best_word, best_sim

# Example: king - man + woman = ?
answer, score = analogy("man", "king", "woman", vectors, vocab)
print(f"{answer} (similarity: {score:.3f})")  # queen (similarity: 0.891)

Why does vec("king") − vec("man") + vec("woman") ≈ vec("queen")?

Semantic relationships are encoded as consistent directions in the vector space. Subtracting "man" from "king" isolates the "royalty" direction, and adding "woman" moves to the female royal word The words are stored in alphabetical order and the arithmetic works by coincidence The model was explicitly trained with analogy labels

Chapter 8: Evaluation

How do you measure whether word vectors are "good"? The paper introduced a systematic evaluation based on word analogies and compared against prior methods.

The analogy test set

The paper created a test set of 8,869 semantic analogies and 10,675 syntactic analogies. An analogy is answered correctly only if the exact word is the nearest neighbor — no partial credit.

Category	# Questions	Example
Common capitals	506	Athens : Greece :: Baghdad : Iraq
All capitals	4,524	Abuja : Nigeria :: Accra : Ghana
Currency	866	Algeria : dinar :: Angola : kwanza
City-in-state	2,467	Chicago : Illinois :: Houston : Texas
Family	506	boy : girl :: brother : sister
Adjective-adverb	992	apparent : apparently :: rapid : rapidly
Opposite	812	aware : unaware :: possible : impossible
Comparative	1,332	bad : worse :: big : bigger
Superlative	1,056	bad : worst :: big : biggest
Present-past tense	1,560	dance : danced :: fly : flew
Nationality	1,599	Australia : Australian :: France : French
Plural nouns	1,332	banana : bananas :: bird : birds
Plural verbs	870	decrease : decreases :: go : goes

Key results

The paper compared CBOW and Skip-gram against previous approaches (NNLM, RNNLM) across dimensions d and training data sizes:

Model	Dim	Training Words	Semantic Acc %	Syntactic Acc %
NNLM	100	6B	23	53
RNNLM	640	320M	9	36
CBOW	300	1.6B	16	53
Skip-gram	300	1.6B	55	59
Skip-gram	1000	6B	66	65

Skip-gram with d=1000 on 6 billion words achieved 66% accuracy on semantic analogies and 65% on syntactic analogies — dramatically better than any prior method. The improvement came from two factors: the more efficient architecture allowed training on much larger datasets, and the Skip-gram objective captured semantic structure more effectively.

What the paper discovered about scaling

Three scaling laws emerged:

More data beats bigger models. 300-dimensional vectors on 6B words beat 600-dimensional vectors on 1B words.
Diminishing returns on dimension. Going from 100 to 300 dimensions helped a lot. Going from 300 to 1000 helped less.
Training time matters. More epochs on the same data improved results, up to a point. But more data was always better than more epochs.

Accuracy vs. Training Corpus Size

Analogy accuracy improves with more training data. The curve shows diminishing returns but no plateau — more data always helps. Drag to see how dimension and data interact.

Dimension d 300

Training details: The paper used Google News data (6 billion tokens, 1 million word vocabulary). Training Skip-gram with d=300 on this data took about a day on a single machine using the hierarchical softmax optimization. By comparison, the NNLM baseline (Bengio et al., 2003) needed weeks on a much smaller dataset. The 1000x speedup came from the shallow architecture and hierarchical softmax — no hidden layer nonlinearity, just embeddings.

Why NNLM and RNNLM were slower

The Neural Network Language Model (NNLM) of Bengio et al. used a full hidden layer with nonlinearities:

NNLM

Concatenate context embeddings → Hidden layer (tanh) → Output softmax over V

Word2Vec

Average/lookup context embeddings → Output (hier. softmax or NEG)

The hidden layer in NNLM requires a matrix multiplication of size (c·d × H) plus a nonlinearity, plus another matrix multiplication (H × V). Word2Vec removes the hidden layer entirely — the input embeddings connect directly to the output. This is why Word2Vec is called a "shallow" model. It has exactly zero hidden layers.

Complete PyTorch implementation of Skip-gram with NEG

python
import torch
import torch.nn as nn

class SkipGramNEG(nn.Module):
    def __init__(self, V, d):
        super().__init__()
        self.W_in = nn.Embedding(V, d)    # input embeddings
        self.W_out = nn.Embedding(V, d)   # output embeddings
        # Initialize small random values
        nn.init.uniform_(self.W_in.weight, -0.5/d, 0.5/d)
        nn.init.zeros_(self.W_out.weight)

    def forward(self, center, context, negatives):
        """
        center:    (batch,) — center word IDs
        context:   (batch,) — true context word IDs
        negatives: (batch, k) — negative sample IDs
        Returns: scalar loss
        """
        v_in = self.W_in(center)          # (batch, d)
        v_pos = self.W_out(context)       # (batch, d)
        v_neg = self.W_out(negatives)     # (batch, k, d)

        # Positive: push context toward center
        pos_dot = (v_in * v_pos).sum(dim=1)         # (batch,)
        pos_loss = -torch.nn.functional.logsigmoid(pos_dot)

        # Negative: push noise away from center
        neg_dot = torch.bmm(v_neg, v_in.unsqueeze(2)).squeeze()  # (batch, k)
        neg_loss = -torch.nn.functional.logsigmoid(-neg_dot).sum(dim=1)

        return (pos_loss + neg_loss).mean()

The entire model is two embedding layers and a few dot products. No hidden layers, no convolutions, no attention. This simplicity is what allowed training on billions of words in 2013 — years before GPUs became standard for NLP.

What Word2Vec is NOT

Many descriptions of Word2Vec call it a "deep learning" model. This is misleading. Word2Vec has zero hidden layers. It is a log-linear model — the simplest possible neural network. Its power comes not from depth or complexity but from:

The prediction task: Predicting context from words forces the embeddings to encode semantic structure.
Scale: Training on billions of words — enabled by the shallow architecture and efficient softmax alternatives — gives the model enough signal to discover fine-grained relationships.
The embedding space: The continuous vector space allows algebraic operations (addition, subtraction, cosine similarity) that are impossible with discrete representations.

The lesson for modern ML: sometimes the biggest breakthroughs come from simplifying models and training them on more data, not from adding complexity.

gensim: the practical tool

In practice, most people use the gensim Python library to train and use Word2Vec, not a custom implementation:

python
from gensim.models import Word2Vec

# Train on your own corpus
sentences = [["the", "cat", "sat"], ["the", "dog", "ran"]]
model = Word2Vec(
    sentences,
    vector_size=300,    # embedding dimension
    window=5,            # context window size
    min_count=5,         # ignore rare words
    sg=1,                # 1 = Skip-gram, 0 = CBOW
    negative=5,          # number of negative samples
    sample=1e-5,         # subsampling threshold
    workers=4,           # parallel threads
    epochs=5,
)

# Use the trained model
vec = model.wv["cat"]                       # (300,)
sim = model.wv.similarity("cat", "dog")    # 0.76
neighbors = model.wv.most_similar("king")  # [("queen", 0.73), ...]

# Analogy: king - man + woman = ?
result = model.wv.most_similar(
    positive=["king", "woman"],
    negative=["man"]
)
print(result[0])  # ("queen", 0.89)

# Load pre-trained Google News vectors (1.6 GB)
from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin",
    binary=True
)

Pre-trained Word2Vec: Google released pre-trained Word2Vec vectors (3 million words, d=300, trained on 100B words of Google News). These vectors were the standard NLP initialization from 2013-2018. Download size: 1.5 GB binary format. Many researchers never trained their own Word2Vec — they just loaded the pre-trained vectors and used them as features.

Using Word2Vec for downstream tasks

Word2Vec vectors were used as features in virtually every NLP system from 2013-2018:

Task	How Word2Vec was used	Improvement over one-hot
Sentiment analysis	Average word vectors over document, feed to classifier	+5-10% accuracy
Named entity recognition	Concatenate word vector with other features for CRF	+2-3 F1 points
Machine translation	Initialize encoder/decoder embeddings	Faster convergence, +1-2 BLEU
Question answering	Embed question and passages, compute similarity	+5-15% accuracy
Relation extraction	Word vectors as input features to CNN/LSTM	+3-5 F1 points

The common pattern: take pre-trained Word2Vec vectors, use them as the input representation for a task-specific model, and fine-tune or freeze depending on dataset size. This "pre-train then fine-tune" paradigm became the standard approach and was later scaled up dramatically by ELMo, BERT, and GPT.

Bias in word vectors

Bolukbasi et al. (2016) famously showed that Word2Vec encodes societal biases:

vec("man") − vec("woman") ≈ vec("computer programmer") − vec("homemaker")

vec("man") − vec("woman") ≈ vec("doctor") − vec("nurse")

These reflect biases in the training data (Google News), not facts about the world. Debiasing techniques were developed (projecting out the gender direction), but this remains an active area of research. Every word embedding method — Word2Vec, GloVe, FastText, BERT — inherits biases from its training data.

Dimensionality and information theory

Why 300 dimensions? The paper tested d from 50 to 1000 and found 300 to be a sweet spot. There is an information-theoretic argument for why:

A vocabulary of 1M words has log₂(1M) ≈ 20 bits of information per word (just the ID).
Semantic relationships add much more information: each word has syntactic category, semantic features, sentiment, register, etc.
With 300 float32 dimensions, each vector can represent 300 × 32 = 9,600 bits — but with noise and redundancy, the effective information is much less.
Empirically, 300 dimensions capture enough semantic axes: gender, royalty, animacy, size, tense, plurality, formality, etc.

Going beyond 300 starts capturing corpus-specific noise rather than generalizable semantic structure. This is a form of overfitting to the training corpus.

The Word2Vec ecosystem

Word2Vec spawned an entire ecosystem of tools and extensions:

Tool/Extension	Contribution
gensim (Python)	Most popular implementation for training and using Word2Vec
FastText (Facebook)	Extended to character n-grams for morphology and OOV words
doc2vec	Extended Word2Vec to learn document-level vectors
sense2vec	Disambiguated senses: "bank_NOUN" vs. "bank_VERB"
node2vec	Applied Word2Vec to graph nodes via random walks
item2vec	Applied to recommendations: items as "words," sessions as "sentences"

The "2vec" naming convention became ubiquitous — any domain where you want embeddings from sequences adopted the Word2Vec framework.

What each dimension "means"

Individual dimensions of Word2Vec vectors don't correspond to interpretable features. Unlike hand-crafted features (where dimension 1 might be "is animal?" and dimension 2 might be "is living?"), Word2Vec dimensions are distributed. A single semantic concept (like "animality") is spread across many dimensions, and each dimension encodes many concepts simultaneously.

However, directions in the space can be interpretable. The vector king − queen ≈ man − woman shows that the "gender" concept corresponds to a specific direction. PCA analysis reveals that the top principal components often correspond to broad categories like frequency, concreteness, and sentiment — but individual dimensions do not.

This distributed representation is what gives Word2Vec its power: it can represent exponentially more concepts with d dimensions than a one-hot-per-concept scheme. With 300 dimensions, the space can distinguish billions of semantic nuances — far more than 300 hand-crafted features could capture.

Nearest neighbor exploration

The most intuitive way to verify that Word2Vec learned meaningful representations is to look at nearest neighbors. Here are real examples from the Google News vectors:

Query word	Top 5 nearest neighbors (cosine)
python	pythons, snake, cobra, viper, reptile
javascript	jQuery, CSS, HTML, PHP, Ajax
berlin	Frankfurt, Munich, Hamburg, Stuttgart, Dresden
guitar	guitars, bass, drums, piano, acoustic
sad	saddened, heartbroken, upset, depressed, tragic
Einstein	relativity, physicist, Bohr, quantum, Heisenberg

Notice how "python" returns snakes (the word is more common in news as an animal), while "javascript" returns web technologies. The context in the training data determines the neighborhood. If you trained on a programming corpus, "python" would neighbor "java," "ruby," and "compiler."

Closest and farthest words

Some of the most distant word pairs (lowest cosine similarity) are also revealing:

"king" and "broccoli" (sim ≈ 0.05) — no semantic overlap whatsoever
"democracy" and "pizza" (sim ≈ 0.08) — completely unrelated domains
"running" and "theorem" (sim ≈ 0.06) — activity vs. abstraction

The near-zero (but not exactly zero) similarity reflects that in a 300-dimensional space, random vectors are approximately orthogonal. Unrelated words behave like random directions.

Visualizing word vectors with t-SNE

Since word vectors live in 300-dimensional space, we can't visualize them directly. t-SNE (t-distributed Stochastic Neighbor Embedding) projects them to 2D while preserving local neighborhood structure. The result shows clear semantic clusters:

python
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Select 200 words to visualize
words = ["king", "queen", "prince", "princess",
         "man", "woman", "boy", "girl",
         "cat", "dog", "fish", "bird",
         "car", "truck", "bus", "bicycle",
         # ... more words ...]

vecs = np.array([model.wv[w] for w in words])  # (200, 300)

# Project to 2D
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
vecs_2d = tsne.fit_transform(vecs)  # (200, 2)

# Plot — animals cluster together, royalty clusters together, etc.
plt.scatter(vecs_2d[:, 0], vecs_2d[:, 1])
for i, word in enumerate(words):
    plt.annotate(word, vecs_2d[i])

The t-SNE plot reveals clusters that no one programmed: animals group together, vehicles group together, royalty groups together. Gender-related words form parallel lines (king/queen, prince/princess, man/woman). This emergent structure is the hallmark of distributional word vectors.

Historical context: why 2013?

Word embeddings existed before Word2Vec. Bengio et al. (2003) had shown that neural networks could learn word vectors. Why did Word2Vec — not the 2003 paper — trigger the revolution?

Scale: Bengio's NNLM could train on millions of words. Word2Vec could train on billions. The 1000x scale difference was the game-changer.
Speed: Word2Vec's shallow architecture and hierarchical softmax made training fast enough to iterate quickly. Researchers could train vectors overnight and evaluate them the next day.
The analogy surprise: The king − man + woman = queen result captured the imagination of the entire field. It was a vivid, memorable demonstration that word vectors had learned something deep about language.
Public release: Mikolov released both the source code (word2vec.c) and pre-trained vectors. This lowered the barrier to entry enormously — anyone could download and use the vectors immediately.
Timing: The deep learning revolution was just beginning. Word2Vec rode the wave of excitement around neural networks and showed that even shallow neural methods could discover rich structure in data.

The training configuration that won

The best configuration from the paper:

Model

Skip-gram (not CBOW)

Dimension

d = 1000 (diminishing returns above 300)

Data

6 billion words (Google News)

Training

Hierarchical softmax, 1 epoch

Result

66% semantic, 65% syntactic analogy accuracy

Speed comparison

Model	Training time	Why
NNLM (Bengio 2003)	Weeks on CPU	Hidden layer + full softmax over V
RNNLM	Days on CPU	Recurrent computation is sequential
CBOW	Hours on CPU	No hidden layer, hier. softmax
Skip-gram	~1 day on CPU	Shallow model + hier. softmax + multi-threading

Training Speed Comparison

Word2Vec's shallow architecture enables training that is orders of magnitude faster than prior neural language models on the same data.

Word2Vec achieved a 100-1000x speedup over prior neural language models. This speedup came entirely from simplifying the model (removing the hidden layer), not from better hardware. The lesson: simpler models can learn from more data, and more data wins.

What was the single most important factor for improving word vector quality in the paper's experiments?

More training data — 300-dim vectors on 6B words outperformed 600-dim vectors on 1B words, showing that data quantity mattered more than model size Larger embedding dimensions Using CBOW instead of Skip-gram

Chapter 9: Connections

What Word2Vec built on

Neural language models (Bengio et al., 2003): The first neural network to learn word embeddings as a byproduct of language modeling. Word2Vec stripped away the hidden layers and the language modeling objective, keeping only the embedding learning — making it vastly more scalable.

Latent Semantic Analysis (LSA) (Deerwester et al., 1990): The original count-based approach to word vectors. Build a word-document co-occurrence matrix and reduce dimensionality with SVD. Word2Vec achieved better results by predicting rather than counting.

Distributional semantics (Firth, 1957; Harris, 1954): The linguistic theory that meaning comes from usage patterns. Word2Vec operationalized this with a neural network that turned the theory into a practical algorithm.

What Word2Vec enabled

Word2Vec extensions (Mikolov et al., 2013b): The follow-up paper introducing negative sampling, subsampling, and phrase vectors. Covered in our companion veanor.

GloVe (Pennington et al., 2014): Showed that Word2Vec's Skip-gram implicitly factorizes a word-context co-occurrence matrix. GloVe made this explicit by directly optimizing a weighted least-squares objective on co-occurrence counts. See our GloVe veanor.

FastText (Bojanowski et al., 2017): Extended Word2Vec to use character n-grams, enabling embeddings for out-of-vocabulary words. The vector for "unfairness" is the sum of subword vectors like "un," "fair," "ness."

ELMo (Peters et al., 2018): Moved beyond static word vectors to context-dependent representations using a bidirectional LSTM. "Bank" gets different vectors in "river bank" vs. "bank account."

BERT and Transformers (Devlin et al., 2018): Transformers produce fully contextualized embeddings. Every token gets a unique vector based on the full input. The embedding layer of a transformer is essentially a Word2Vec-style lookup table — the connection is direct.

The Transformer embedding connection

The very first layer of every transformer (GPT, BERT, LLaMA) is an embedding lookup table — exactly a Word2Vec-style matrix W of shape (V, d). Token ID in, dense vector out. The difference is what happens after: transformers stack attention layers on top of the initial embeddings, producing context-dependent representations. But the initial embedding is still a static lookup, learned end-to-end.

python
import torch.nn as nn

# In a transformer, the first layer is literally Word2Vec's W_in:
embedding = nn.Embedding(vocab_size, d_model)  # shape: (V, d)

# Token IDs → dense vectors, exactly like Word2Vec
token_ids = torch.tensor([42, 17, 8391])
embeddings = embedding(token_ids)  # shape: (3, d_model)

# Then transformers add positional encoding + self-attention layers
# Word2Vec stops here; transformers build context on top

Word2Vec's lasting legacy

The paradigm shift: Word2Vec demonstrated three things that reshaped NLP. (1) Simple prediction tasks on raw text learn rich representations. (2) Linear structure in embedding spaces encodes semantic relationships. (3) Scale (data + efficient algorithms) matters more than model complexity. These insights directly foreshadow the transformer era: GPT and BERT are, at their core, prediction machines learning representations from vast amounts of text. Word2Vec was the proof of concept.

Known limitations

Static vectors. Each word gets one vector regardless of context. "Bank" in "river bank" and "bank account" get the same vector. ELMo and BERT solved this with contextual embeddings.
Antonym confusion. "Good" and "bad" get similar vectors because they appear in similar contexts. The distributional hypothesis captures substitutability, not meaning.
No compositionality. The vector for "not good" is just the sum of "not" and "good" vectors, which doesn't capture the negation. No mechanism for phrase-level semantics beyond the follow-up paper's phrase detection.
Bias. Word2Vec learned from human-generated text and encodes societal biases. "Man" is to "computer programmer" as "woman" is to "homemaker" — reflecting biased training data, not truth.

Cheat sheet

Core idea

Learn word vectors by predicting context (Skip-gram) or center (CBOW) words using a shallow neural network

Two architectures

CBOW: context → center (fast, good for frequent words). Skip-gram: center → context (better for rare words)

Key property

Vector arithmetic: king − man + woman ≈ queen. Semantic relationships are directions in the space

Scaling

Hierarchical softmax: O(log V) vs O(V). Trained on 6B words in a day. More data > bigger model

Impact

Launched the "word embeddings" era. Directly enabled GloVe, FastText, ELMo, and the transformer revolution

What are the three key insights from Word2Vec that foreshadowed the transformer era?

(1) Simple prediction on raw text learns rich representations, (2) Linear structure in embedding spaces encodes semantics, (3) Scale (data + efficiency) matters more than model complexity (1) Attention is all you need, (2) Bigger models are always better, (3) Supervised learning beats unsupervised (1) Word order is essential, (2) Deep networks outperform shallow ones, (3) Character-level models are superior

Word2Vec: Words as Vectors

Chapter 0: The Problem