Two shallow neural networks that learn to embed words into a continuous vector space where similar words cluster together — and king − man + woman ≈ queen.
You are building a search engine. A user types "automobile" and you want to return documents about "cars." You are building a translation system and need to know that "king" and "queen" are related the way "man" and "woman" are related. You are building a sentiment analyzer and want to know that "excellent" and "superb" carry the same meaning.
In 2013, the dominant way to represent words in NLP was as atomic symbols. Each word was an index in a vocabulary. "Cat" was word #4,271. "Dog" was word #8,392. There was no notion of similarity between them. The number 4,271 tells you nothing about cats, and subtracting 8,392 from 4,271 tells you nothing about the relationship between cats and dogs.
This representation — a word as an ID — fails in three fundamental ways:
Left: words as isolated IDs with no structure. Right: words as points in a continuous space where proximity encodes meaning. Click "Word2Vec" to see what we want to achieve.
Before Word2Vec, the standard way to feed words into a neural network was the one-hot vector. If your vocabulary has V words, you represent each word as a vector of length V with exactly one entry set to 1 and all others set to 0.
For example, with a tiny vocabulary of 5 words:
The problem is now mathematically precise. The dot product of any two different one-hot vectors is exactly zero:
"Cat" is as different from "dog" as "king" is from "fish." Every pair of words is equally dissimilar. This is the orthogonality problem: one-hot vectors are mutually orthogonal, so no similarity structure exists.
With a real vocabulary, this gets worse. Google's Word2Vec was trained with V = 1,000,000. That means each word is a million-dimensional vector with 999,999 zeros. This is wasteful, uninformative, and computationally prohibitive for any downstream model to process directly.
The contrast between one-hot and dense representations is stark:
| Property | One-hot | Dense embedding |
|---|---|---|
| Dimensions | V (e.g., 1,000,000) | d (e.g., 300) |
| Non-zero entries | 1 | All d |
| Memory per word | V × 4 bytes = 4 MB | d × 4 bytes = 1.2 KB |
| Dot product meaning | 0 for all pairs (no info) | Similarity score |
| Information content | log2(V) bits (just an ID) | ~300 floats of learned semantics |
What if we learned a matrix W of shape (V × d), where d ≪ V? Each row of W is a d-dimensional vector for one word. To get the embedding for word i, we simply multiply:
The one-hot multiplication is just a row lookup. But if we learn the values in W by training on a useful task, the resulting rows will encode meaning. This is the core of Word2Vec: learn W by predicting words from their context.
Click a word to see its one-hot vector multiplied by the embedding matrix W. The result is a dense, low-dimensional vector. Drag the slider to change the embedding dimension d.
python import numpy as np V = 10000 # vocabulary size d = 300 # embedding dimension # The embedding matrix (randomly initialized, then learned) W = np.random.randn(V, d) * 0.01 # One-hot lookup is just row indexing word_id = 42 one_hot = np.zeros(V) one_hot[word_id] = 1 # These two are equivalent: embedding_matmul = one_hot @ W # shape: (d,) embedding_lookup = W[word_id] # shape: (d,) — just a row lookup assert np.allclose(embedding_matmul, embedding_lookup)
We want to learn dense word vectors where similar words end up close together. But what does "similar" mean? How do we decide that "cat" should be near "kitten" and far from "democracy"?
The answer comes from a beautifully simple idea from linguistics, articulated by J.R. Firth in 1957:
This is the distributional hypothesis. It says that semantic similarity can be derived entirely from patterns of co-occurrence in text. You do not need dictionaries, ontologies, or human annotations. You just need a lot of text.
Consider these sentences:
"Cat" and "kitten" share almost identical contexts (sat, on, the, mat/rug, purred). "Dog" shares most context with "cat" (sat, on, the, mat) but differs in one key word (barked vs. purred). "President" shares far less context. From this, a good algorithm should learn: cat ≈ kitten > dog > president, in terms of similarity.
Word2Vec operationalizes this by defining a context window of size c around each target word. For the sentence "the cat sat on the mat" with c = 2, the context of "sat" is {the, cat, on, the} — the two words before and after.
Watch the context window slide across a sentence. The center word is the target (orange); the highlighted words are the context (teal). Drag the slider to change window size c.
Word2Vec's genius is turning the distributional hypothesis into a prediction task. Instead of counting co-occurrences (the traditional approach), it predicts words from their context. The prediction errors shape the embedding vectors. Two words that can substitute for each other in many contexts end up with similar vectors — because similar vectors make similar predictions.
Before Word2Vec, the distributional hypothesis was applied through count-based methods: build a large matrix of word co-occurrence counts, then reduce its dimensionality with SVD. This worked but had limitations:
Word2Vec showed that you can skip the matrix entirely. Instead of counting how often words co-occur and then compressing the counts, just train a model to predict co-occurrence. The embeddings learned by prediction turn out to have richer structure than those learned by counting — specifically, the linear analogy structure (king − man + woman ≈ queen) that made Word2Vec famous.
The window size c controls what kind of similarity the vectors capture:
| Window size | Type of similarity | Example neighbors of "dog" |
|---|---|---|
| c = 1-2 (narrow) | Syntactic | cat, horse, bird (same syntactic role) |
| c = 5-10 (wide) | Semantic | puppy, pet, canine (same meaning) |
| c = 15+ (very wide) | Topical | veterinarian, leash, breed (same topic) |
Narrow windows find words that can substitute for each other in a sentence. Wide windows find words that appear in the same kinds of documents. The paper's default of c = 5-10 captures a mix of syntactic and semantic similarity.
To build intuition, let's manually trace what Word2Vec would learn from a tiny corpus. Consider just four sentences:
With c = 1 (looking at only the immediately adjacent words), the context for each word includes:
| Target word | Observed contexts (c=1) |
|---|---|
| cat | the(4x), sat(1x), ate(1x) |
| dog | the(4x), sat(1x), ate(1x) |
| sat | cat(1x), dog(1x), on(2x) |
| ate | cat(1x), dog(1x), the(2x) |
| mat | the(1x) |
| fish | the(1x) |
"Cat" and "dog" have identical context distributions! They both appear after "the" and before "sat" and "ate." Word2Vec would give them nearly identical vectors. Meanwhile, "sat" and "ate" also have similar contexts (both appear after "cat" and "dog"), so they'd be close too — but slightly different because "sat" co-occurs with "on" while "ate" co-occurs with "the."
This simple example illustrates the distributional hypothesis in action: words with the same syntactic and semantic roles get the same contexts, and therefore the same vectors.
The first of Word2Vec's two models is Continuous Bag-of-Words (CBOW). The task: given the context words around a gap, predict the missing center word.
For the sentence "the cat sat on the mat" with window c = 2, if the target is "sat" (position 2), the training example is:
The architecture is a single hidden layer neural network:
Each context word is mapped to its embedding using the input matrix Win (shape V × d). This is just a row lookup for each context word.
CBOW takes the average of all context word embeddings. This is the "bag-of-words" part — order does not matter, only which words are present.
Where h is the hidden representation (shape d).
The hidden vector h is multiplied by an output matrix Wout (shape d × V) to produce scores for every word in the vocabulary, then softmax converts these to probabilities:
The loss is the negative log-probability of the true target word:
Context words are embedded, averaged into a hidden vector h, then projected to a probability distribution over the vocabulary. Click "Step" to walk through each stage.
python import numpy as np def cbow_forward(context_ids, W_in, W_out): """ context_ids: list of int, shape (2*c,) — IDs of context words W_in: (V, d) — input embedding matrix W_out: (d, V) — output projection matrix Returns: probability distribution over vocab, shape (V,) """ # Step 1: look up embeddings for each context word embeddings = W_in[context_ids] # shape: (2c, d) # Step 2: average them h = embeddings.mean(axis=0) # shape: (d,) # Step 3: project to vocabulary scores scores = h @ W_out # shape: (V,) # Step 4: softmax exp_scores = np.exp(scores - scores.max()) probs = exp_scores / exp_scores.sum() return probs # P(w | context) for each word w
Let's trace a complete forward and backward pass with a tiny vocabulary of 3 words and embedding dimension d = 2.
Vocabulary: {cat = 0, sat = 1, mat = 2}. Context = {cat, mat}, target = sat. Window c = 1.
Suppose our current matrices are:
Forward pass:
Backward pass:
python # Complete CBOW training step with backpropagation import numpy as np V, d = 3, 2 W_in = np.array([[0.5, -0.2], [0.1, 0.8], [-0.3, 0.4]]) W_out = np.array([[0.3, 0.6], [-0.1, 0.5], [0.2, -0.3]]) context_ids = [0, 2] # cat, mat target_id = 1 # sat lr = 0.1 # Forward h = W_in[context_ids].mean(axis=0) # [0.1, 0.1] scores = W_out @ h # [0.09, 0.04, -0.01] exp_s = np.exp(scores - scores.max()) probs = exp_s / exp_s.sum() # [0.350, 0.333, 0.317] loss = -np.log(probs[target_id]) # 1.099 # Backward error = probs.copy() error[target_id] -= 1 # [0.35, -0.67, 0.32] # Gradient for output matrix: outer product of error and h dW_out = np.outer(error, h) # (3, 2) W_out -= lr * dW_out # Gradient for hidden: project error back through output matrix dh = W_out.T @ error # (2,) # Distribute equally to all context words (CBOW averages, so grad is shared) for cid in context_ids: W_in[cid] -= lr * dh / len(context_ids) print(f"Loss before: {loss:.3f}") # 1.099 print(f"P(sat|context) before: {probs[target_id]:.3f}") # 0.333
After one gradient step, three things happen simultaneously:
Over millions of training examples, this push-pull dynamic shapes the embedding space. Words that frequently predict each other (because they share contexts) have their vectors pulled together. Words that should not predict each other have their vectors pushed apart.
Skip-gram flips CBOW on its head. Instead of predicting the center word from the context, it predicts each context word from the center word. Same data, opposite direction.
For the same sentence "the cat sat on the mat" with window c = 2 and target "sat":
This generates 2c training pairs from each word position. For c = 2, each center word produces 4 pairs: (sat, the), (sat, cat), (sat, on), (sat, the).
Even simpler than CBOW — no averaging needed:
Step 1: Look up the embedding of the center word from Win:
Step 2: For each context position, compute the probability of each vocabulary word using the output matrix Wout:
Step 3: The loss sums over all context positions:
Skip-gram treats each (center, context) pair as a separate training example, so it generates more training signal per word. For a sentence of length T with window c, Skip-gram produces ~2cT training pairs versus T for CBOW. This extra data is especially helpful for rare words: a rare word appears as a center word 2c times more often in Skip-gram's training data (once per context position) than in CBOW's (once total).
For the sentence "the cat sat on the mat" with c = 2, here are all the training pairs generated by Skip-gram:
| Center word | Context word | Distance |
|---|---|---|
| "the" (pos 0) | "cat" (pos 1) | 1 |
| "the" (pos 0) | "sat" (pos 2) | 2 |
| "cat" (pos 1) | "the" (pos 0) | 1 |
| "cat" (pos 1) | "sat" (pos 2) | 1 |
| "cat" (pos 1) | "on" (pos 3) | 2 |
| "sat" (pos 2) | "the" (pos 0) | 2 |
| "sat" (pos 2) | "cat" (pos 1) | 1 |
| "sat" (pos 2) | "on" (pos 3) | 1 |
| "sat" (pos 2) | "the" (pos 4) | 2 |
| ... and so on for "on", "the", "mat" | ||
This 6-word sentence generates approximately 6 × 2 × 2 = 24 training pairs. A corpus of 6 billion words with c = 5 generates ~60 billion pairs. Each pair provides a gradient signal that shapes the embedding space.
Skip-gram makes a strong simplifying assumption: each context word is predicted independently. The objective sums over context positions without any interaction:
This means Skip-gram cannot capture the fact that "cat" and "mat" both being in the context is more informative than either alone. This is a deliberate simplification for computational efficiency — and it works remarkably well despite the approximation.
Toggle between architectures to see how data flows differently. CBOW: many-to-one (context predicts center). Skip-gram: one-to-many (center predicts context).
python def skipgram_forward(center_id, W_in, W_out): """ center_id: int — ID of center word W_in: (V, d) — input embedding matrix W_out: (V, d) — output embedding matrix (note: transposed vs CBOW) Returns: probability distribution over vocab, shape (V,) """ # Step 1: look up center word embedding h = W_in[center_id] # shape: (d,) # Step 2: score every vocabulary word scores = W_out @ h # shape: (V,) — dot product of each output vector with h # Step 3: softmax exp_scores = np.exp(scores - scores.max()) probs = exp_scores / exp_scores.sum() return probs # P(w_context | w_center) for each word # Training: for each center word, predict each context word for center_pos in range(len(sentence)): center_id = sentence[center_pos] for offset in range(-c, c+1): if offset == 0: continue context_pos = center_pos + offset if 0 <= context_pos < len(sentence): context_id = sentence[context_pos] probs = skipgram_forward(center_id, W_in, W_out) loss = -np.log(probs[context_id] + 1e-10)
Both CBOW and Skip-gram end with a softmax over the entire vocabulary. Let's understand exactly what this means computationally — and why it becomes a bottleneck.
For Skip-gram, the probability of context word wO given center word wI is:
Here, vw is the input vector (from Win) and v'w is the output vector (from Wout) for word w. The dot product v'wO · vwI measures how well the center word predicts this particular context word.
The denominator sums over all V words. For V = 1,000,000, every single training step requires computing 1 million dot products and exponentials. With a corpus of billions of words and multiple context positions per word, this becomes impossibly slow.
For V = 106 and d = 300, that is 300 million floating-point operations just for the softmax — per training example. With billions of training examples, full softmax training would take years.
The gradient of the loss with respect to the output vector v'j is:
Where δj=wO is 1 if j is the actual context word and 0 otherwise. This gradient has a beautiful interpretation: for the correct context word (j = wO), the gradient pushes v'j closer to vwI by an amount proportional to 1 − P(j | wI). For all other words, the gradient pushes v'j away from vwI by an amount proportional to P(j | wI).
Computing softmax naively causes overflow. If any score is large (say 500), exp(500) = ∞. The standard trick: subtract the maximum score before exponentiating.
This gives the same probabilities (the max cancels in the ratio) but keeps all exponentials in a numerically safe range. Every softmax implementation uses this trick.
The loss −log P(wtarget | wcenter) is exactly the cross-entropy between the one-hot label vector and the softmax output. This is the same loss used in standard classification. Word2Vec is classification — it classifies the center word's context into one of V categories. The word vectors are a byproduct of learning to classify well.
This insight — that useful representations emerge as byproducts of prediction tasks — is one of the most important ideas in modern ML. It foreshadows BERT (predict masked tokens), GPT (predict next token), and contrastive learning (predict matching pairs).
Vocabulary: V = 4 words: {cat=0, dog=1, sat=2, mat=3}. Center word: "sat" (id=2). True context: "cat" (id=0). Embedding dim d = 2.
Current vectors:
Dot products (scores):
Softmax (subtract max = 0.20 for stability):
Loss:
The model gives "cat" a 27.3% probability. Not terrible for random initialization (uniform would be 25%), but there's room for improvement. The gradient will push v'cat toward vsat and push the other output vectors away.
Gradient for each output vector v'j:
After subtracting lr · gradient: v'cat moves 0.036 toward vsat. The other three output vectors each move slightly away from vsat. Over millions of such updates, the geometry of the embedding space takes shape.
The denominator requires summing over all V words. Drag the vocabulary size to see how cost scales linearly. The red zone marks where full softmax becomes impractical.
python def skipgram_loss_and_gradient(center_id, context_id, W_in, W_out): """Full softmax loss and gradient for one (center, context) pair.""" v_in = W_in[center_id] # (d,) — center embedding scores = W_out @ v_in # (V,) — dot product with every output vector # Softmax — this sum over V is the bottleneck exp_scores = np.exp(scores - scores.max()) probs = exp_scores / exp_scores.sum() # (V,) # Loss: negative log-prob of the true context word loss = -np.log(probs[context_id] + 1e-10) # Gradient w.r.t. output vectors grad_out = probs.copy() # (V,) — P(j|center) for all j grad_out[context_id] -= 1 # subtract 1 for the true word dW_out = np.outer(grad_out, v_in) # (V, d) — updates ALL V output vectors # Gradient w.r.t. input vector of center word dv_in = W_out.T @ grad_out # (d,) return loss, dv_in, dW_out
The original Word2Vec paper introduced hierarchical softmax as the first solution to the softmax bottleneck. The follow-up paper (Mikolov et al., 2013b) introduced negative sampling, which became more popular. We cover hierarchical softmax here; negative sampling gets its own veanor.
Instead of a flat softmax over V words, arrange the vocabulary as a binary tree. Each word is a leaf. To compute the probability of a word, you walk from the root to that leaf, making a binary (left/right) decision at each internal node.
At each internal node n, you compute:
The probability of reaching a word w at depth L is the product of the binary decisions along the path:
where sj ∈ {−1, +1} encodes whether you go left or right at node j.
A balanced binary tree has depth log2(V). So instead of summing over V terms, you compute log2(V) sigmoids. For V = 1,000,000:
That is a 50,000x speedup.
The paper uses a Huffman tree based on word frequency, not a balanced binary tree. Frequent words get shorter paths (fewer decisions), rare words get longer paths. Since frequent words appear more often in training, this minimizes the average number of operations per step.
Consider a tiny Huffman tree with 4 words: the(common), cat, sat, zinc(rare). In a Huffman tree, "the" gets a short path (1 decision) and "zinc" gets a long path (3 decisions).
Suppose the hidden vector is h = [0.5, 0.3] and the tree nodes have vectors:
Then:
Only 1 sigmoid for "the" (most common word) vs. 2 for "cat." The Huffman tree ensures the average path length is minimized based on word frequencies — an optimal prefix code.
During backpropagation, only the node vectors along the path from root to the target word receive gradient updates. For a target word at depth L, only L node vectors are updated (plus the center word's input vector). Compared to full softmax which updates all V output vectors, this is a massive saving.
For efficient sampling of negative words, the original Word2Vec implementation precomputes a large "unigram table" — an array of 100 million entries where each word appears proportionally to f(w)3/4. To sample a negative word, generate a random integer index into this table and look up the word. This is O(1) per sample, much faster than computing cumulative distributions.
python # Build unigram table for O(1) negative sampling import numpy as np def build_unigram_table(word_freqs, table_size=100_000_000, power=0.75): """Build a lookup table for O(1) negative sampling.""" powered = word_freqs ** power norm = powered / powered.sum() table = np.zeros(table_size, dtype=np.int32) idx = 0 cumulative = 0.0 for word_id in range(len(word_freqs)): cumulative += norm[word_id] while idx < table_size and idx / table_size < cumulative: table[idx] = word_id idx += 1 return table # Sampling is now O(1): # neg_word = table[random.randint(0, table_size)]
| Hyperparameter | Paper default | Effect |
|---|---|---|
| Embedding dim d | 300 | Quality increases up to ~300, then plateaus |
| Window size c | 5 (SG), 5 (CBOW) | Larger = more semantic; smaller = more syntactic |
| Learning rate | 0.025 (SG), 0.05 (CBOW) | Linearly decayed during training |
| Min word count | 5 | Words below this threshold are discarded |
| Negative samples k | 5-20 | More = slower but better for small data |
| Subsampling t | 10−5 | Threshold for discarding frequent words |
| Training epochs | 1-5 | More epochs on small data; 1 pass on large data |
A binary tree over the vocabulary. To compute P("cat"), follow the path from root to the "cat" leaf. Each node makes a binary decision using a sigmoid. Click a leaf word to highlight its path.
python def hierarchical_softmax(word_id, h, tree_vectors, tree_paths, tree_codes): """ word_id: target word h: hidden vector (d,) — from center/context embedding tree_vectors: (num_internal_nodes, d) — learned vectors at each node tree_paths: list of lists — path from root to each word tree_codes: list of lists — left(0)/right(1) at each step """ path = tree_paths[word_id] # e.g., [0, 3, 7] — internal node IDs codes = tree_codes[word_id] # e.g., [1, 0, 1] — go right, left, right log_prob = 0.0 for node_id, code in zip(path, codes): v_node = tree_vectors[node_id] # (d,) dot = v_node @ h # scalar sig = 1.0 / (1.0 + np.exp(-dot)) # sigmoid if code == 0: # go left log_prob += np.log(sig + 1e-10) else: # go right log_prob += np.log(1 - sig + 1e-10) return log_prob # log P(word | h) — only log2(V) operations!
This is the most surprising result of the paper — the reason Word2Vec captured the imagination of the entire ML community. The learned word vectors exhibit linear algebraic structure that encodes semantic relationships.
The most famous example:
This is not a cherry-picked curiosity. The paper found systematic, consistent vector relationships across many semantic categories:
| Relationship | Example | Vector operation |
|---|---|---|
| Gender | king : queen :: man : woman | king − man + woman ≈ queen |
| Country-Capital | France : Paris :: Japan : Tokyo | France − Paris + Tokyo ≈ Japan |
| Tense | walking : walked :: swimming : swam | walking − walked + swam ≈ swimming |
| Comparative | big : bigger :: small : smaller | big − bigger + smaller ≈ small |
| Plural | car : cars :: apple : apples | car − cars + apples ≈ apple |
Given an analogy "a is to b as c is to ?", we compute:
Then we find the word whose vector is closest to vec(?) by cosine similarity:
(excluding a, b, c themselves from the search).
The training objective forces the model to capture co-occurrence patterns. "King" and "queen" appear in similar contexts (royal, throne, crown) but differ in gender-related contexts. This difference is captured as a consistent direction in the vector space — the same direction that separates "man" from "woman."
More formally, if a direction d encodes "maleness" then:
Subtracting "man" from "king" removes the "person + d" component and leaves "royalty − person". Adding "woman" gives "royalty − person + person − d = royalty − d = queen". The direction d cancels in the subtraction and reappears in the addition.
Word2Vec uses cosine similarity rather than Euclidean distance to measure word similarity. Why?
Euclidean distance is sensitive to vector magnitude. A word that appears frequently tends to have a larger-magnitude vector (more training updates push it further from the origin). Cosine similarity normalizes by magnitude, measuring only the angle between vectors:
Range: −1 (opposite directions) to +1 (same direction). Two words with cosine similarity 0.9 are nearly interchangeable in many contexts. Words with similarity 0.1 are semantically distant.
| Word pair | Cosine similarity | Interpretation |
|---|---|---|
| cat — kitten | 0.89 | Very similar (same concept) |
| cat — dog | 0.76 | Similar (both pets/animals) |
| cat — car | 0.21 | Unrelated |
| king — queen | 0.73 | Related (both royalty, different gender) |
| king — man | 0.62 | Moderately related (both male, different status) |
| good — bad | 0.72 | High! Antonyms are close because they appear in similar contexts |
Real cosine similarities between word pairs in Word2Vec. Notice that "good" and "bad" (antonyms) are surprisingly similar because they appear in identical syntactic contexts.
Explore word analogies in 2D. The orange arrow (king − man) is the "royalty" direction. Adding it to "woman" lands near "queen." Click different analogy types to see other relationships.
python import numpy as np def analogy(a, b, c, word_vectors, vocab): """ Solve: a is to b as c is to ? word_vectors: dict mapping word -> np.array of shape (d,) vocab: list of all words """ # Compute the target vector: b - a + c target = word_vectors[b] - word_vectors[a] + word_vectors[c] # Normalize the target target = target / (np.linalg.norm(target) + 1e-10) best_word, best_sim = None, -1 exclude = {a, b, c} for word in vocab: if word in exclude: continue vec = word_vectors[word] vec_norm = vec / (np.linalg.norm(vec) + 1e-10) sim = np.dot(target, vec_norm) # cosine similarity if sim > best_sim: best_sim = sim best_word = word return best_word, best_sim # Example: king - man + woman = ? answer, score = analogy("man", "king", "woman", vectors, vocab) print(f"{answer} (similarity: {score:.3f})") # queen (similarity: 0.891)
How do you measure whether word vectors are "good"? The paper introduced a systematic evaluation based on word analogies and compared against prior methods.
The paper created a test set of 8,869 semantic analogies and 10,675 syntactic analogies. An analogy is answered correctly only if the exact word is the nearest neighbor — no partial credit.
| Category | # Questions | Example |
|---|---|---|
| Common capitals | 506 | Athens : Greece :: Baghdad : Iraq |
| All capitals | 4,524 | Abuja : Nigeria :: Accra : Ghana |
| Currency | 866 | Algeria : dinar :: Angola : kwanza |
| City-in-state | 2,467 | Chicago : Illinois :: Houston : Texas |
| Family | 506 | boy : girl :: brother : sister |
| Adjective-adverb | 992 | apparent : apparently :: rapid : rapidly |
| Opposite | 812 | aware : unaware :: possible : impossible |
| Comparative | 1,332 | bad : worse :: big : bigger |
| Superlative | 1,056 | bad : worst :: big : biggest |
| Present-past tense | 1,560 | dance : danced :: fly : flew |
| Nationality | 1,599 | Australia : Australian :: France : French |
| Plural nouns | 1,332 | banana : bananas :: bird : birds |
| Plural verbs | 870 | decrease : decreases :: go : goes |
The paper compared CBOW and Skip-gram against previous approaches (NNLM, RNNLM) across dimensions d and training data sizes:
| Model | Dim | Training Words | Semantic Acc % | Syntactic Acc % |
|---|---|---|---|---|
| NNLM | 100 | 6B | 23 | 53 |
| RNNLM | 640 | 320M | 9 | 36 |
| CBOW | 300 | 1.6B | 16 | 53 |
| Skip-gram | 300 | 1.6B | 55 | 59 |
| Skip-gram | 1000 | 6B | 66 | 65 |
Skip-gram with d=1000 on 6 billion words achieved 66% accuracy on semantic analogies and 65% on syntactic analogies — dramatically better than any prior method. The improvement came from two factors: the more efficient architecture allowed training on much larger datasets, and the Skip-gram objective captured semantic structure more effectively.
Three scaling laws emerged:
Analogy accuracy improves with more training data. The curve shows diminishing returns but no plateau — more data always helps. Drag to see how dimension and data interact.
The Neural Network Language Model (NNLM) of Bengio et al. used a full hidden layer with nonlinearities:
The hidden layer in NNLM requires a matrix multiplication of size (c·d × H) plus a nonlinearity, plus another matrix multiplication (H × V). Word2Vec removes the hidden layer entirely — the input embeddings connect directly to the output. This is why Word2Vec is called a "shallow" model. It has exactly zero hidden layers.
python import torch import torch.nn as nn class SkipGramNEG(nn.Module): def __init__(self, V, d): super().__init__() self.W_in = nn.Embedding(V, d) # input embeddings self.W_out = nn.Embedding(V, d) # output embeddings # Initialize small random values nn.init.uniform_(self.W_in.weight, -0.5/d, 0.5/d) nn.init.zeros_(self.W_out.weight) def forward(self, center, context, negatives): """ center: (batch,) — center word IDs context: (batch,) — true context word IDs negatives: (batch, k) — negative sample IDs Returns: scalar loss """ v_in = self.W_in(center) # (batch, d) v_pos = self.W_out(context) # (batch, d) v_neg = self.W_out(negatives) # (batch, k, d) # Positive: push context toward center pos_dot = (v_in * v_pos).sum(dim=1) # (batch,) pos_loss = -torch.nn.functional.logsigmoid(pos_dot) # Negative: push noise away from center neg_dot = torch.bmm(v_neg, v_in.unsqueeze(2)).squeeze() # (batch, k) neg_loss = -torch.nn.functional.logsigmoid(-neg_dot).sum(dim=1) return (pos_loss + neg_loss).mean()
The entire model is two embedding layers and a few dot products. No hidden layers, no convolutions, no attention. This simplicity is what allowed training on billions of words in 2013 — years before GPUs became standard for NLP.
Many descriptions of Word2Vec call it a "deep learning" model. This is misleading. Word2Vec has zero hidden layers. It is a log-linear model — the simplest possible neural network. Its power comes not from depth or complexity but from:
The lesson for modern ML: sometimes the biggest breakthroughs come from simplifying models and training them on more data, not from adding complexity.
In practice, most people use the gensim Python library to train and use Word2Vec, not a custom implementation:
python from gensim.models import Word2Vec # Train on your own corpus sentences = [["the", "cat", "sat"], ["the", "dog", "ran"]] model = Word2Vec( sentences, vector_size=300, # embedding dimension window=5, # context window size min_count=5, # ignore rare words sg=1, # 1 = Skip-gram, 0 = CBOW negative=5, # number of negative samples sample=1e-5, # subsampling threshold workers=4, # parallel threads epochs=5, ) # Use the trained model vec = model.wv["cat"] # (300,) sim = model.wv.similarity("cat", "dog") # 0.76 neighbors = model.wv.most_similar("king") # [("queen", 0.73), ...] # Analogy: king - man + woman = ? result = model.wv.most_similar( positive=["king", "woman"], negative=["man"] ) print(result[0]) # ("queen", 0.89) # Load pre-trained Google News vectors (1.6 GB) from gensim.models import KeyedVectors wv = KeyedVectors.load_word2vec_format( "GoogleNews-vectors-negative300.bin", binary=True )
Word2Vec vectors were used as features in virtually every NLP system from 2013-2018:
| Task | How Word2Vec was used | Improvement over one-hot |
|---|---|---|
| Sentiment analysis | Average word vectors over document, feed to classifier | +5-10% accuracy |
| Named entity recognition | Concatenate word vector with other features for CRF | +2-3 F1 points |
| Machine translation | Initialize encoder/decoder embeddings | Faster convergence, +1-2 BLEU |
| Question answering | Embed question and passages, compute similarity | +5-15% accuracy |
| Relation extraction | Word vectors as input features to CNN/LSTM | +3-5 F1 points |
The common pattern: take pre-trained Word2Vec vectors, use them as the input representation for a task-specific model, and fine-tune or freeze depending on dataset size. This "pre-train then fine-tune" paradigm became the standard approach and was later scaled up dramatically by ELMo, BERT, and GPT.
Bolukbasi et al. (2016) famously showed that Word2Vec encodes societal biases:
These reflect biases in the training data (Google News), not facts about the world. Debiasing techniques were developed (projecting out the gender direction), but this remains an active area of research. Every word embedding method — Word2Vec, GloVe, FastText, BERT — inherits biases from its training data.
Why 300 dimensions? The paper tested d from 50 to 1000 and found 300 to be a sweet spot. There is an information-theoretic argument for why:
Going beyond 300 starts capturing corpus-specific noise rather than generalizable semantic structure. This is a form of overfitting to the training corpus.
Word2Vec spawned an entire ecosystem of tools and extensions:
| Tool/Extension | Contribution |
|---|---|
| gensim (Python) | Most popular implementation for training and using Word2Vec |
| FastText (Facebook) | Extended to character n-grams for morphology and OOV words |
| doc2vec | Extended Word2Vec to learn document-level vectors |
| sense2vec | Disambiguated senses: "bank_NOUN" vs. "bank_VERB" |
| node2vec | Applied Word2Vec to graph nodes via random walks |
| item2vec | Applied to recommendations: items as "words," sessions as "sentences" |
The "2vec" naming convention became ubiquitous — any domain where you want embeddings from sequences adopted the Word2Vec framework.
Individual dimensions of Word2Vec vectors don't correspond to interpretable features. Unlike hand-crafted features (where dimension 1 might be "is animal?" and dimension 2 might be "is living?"), Word2Vec dimensions are distributed. A single semantic concept (like "animality") is spread across many dimensions, and each dimension encodes many concepts simultaneously.
However, directions in the space can be interpretable. The vector king − queen ≈ man − woman shows that the "gender" concept corresponds to a specific direction. PCA analysis reveals that the top principal components often correspond to broad categories like frequency, concreteness, and sentiment — but individual dimensions do not.
This distributed representation is what gives Word2Vec its power: it can represent exponentially more concepts with d dimensions than a one-hot-per-concept scheme. With 300 dimensions, the space can distinguish billions of semantic nuances — far more than 300 hand-crafted features could capture.
The most intuitive way to verify that Word2Vec learned meaningful representations is to look at nearest neighbors. Here are real examples from the Google News vectors:
| Query word | Top 5 nearest neighbors (cosine) |
|---|---|
| python | pythons, snake, cobra, viper, reptile |
| javascript | jQuery, CSS, HTML, PHP, Ajax |
| berlin | Frankfurt, Munich, Hamburg, Stuttgart, Dresden |
| guitar | guitars, bass, drums, piano, acoustic |
| sad | saddened, heartbroken, upset, depressed, tragic |
| Einstein | relativity, physicist, Bohr, quantum, Heisenberg |
Notice how "python" returns snakes (the word is more common in news as an animal), while "javascript" returns web technologies. The context in the training data determines the neighborhood. If you trained on a programming corpus, "python" would neighbor "java," "ruby," and "compiler."
Some of the most distant word pairs (lowest cosine similarity) are also revealing:
The near-zero (but not exactly zero) similarity reflects that in a 300-dimensional space, random vectors are approximately orthogonal. Unrelated words behave like random directions.
Since word vectors live in 300-dimensional space, we can't visualize them directly. t-SNE (t-distributed Stochastic Neighbor Embedding) projects them to 2D while preserving local neighborhood structure. The result shows clear semantic clusters:
python from sklearn.manifold import TSNE import matplotlib.pyplot as plt # Select 200 words to visualize words = ["king", "queen", "prince", "princess", "man", "woman", "boy", "girl", "cat", "dog", "fish", "bird", "car", "truck", "bus", "bicycle", # ... more words ...] vecs = np.array([model.wv[w] for w in words]) # (200, 300) # Project to 2D tsne = TSNE(n_components=2, perplexity=30, random_state=42) vecs_2d = tsne.fit_transform(vecs) # (200, 2) # Plot — animals cluster together, royalty clusters together, etc. plt.scatter(vecs_2d[:, 0], vecs_2d[:, 1]) for i, word in enumerate(words): plt.annotate(word, vecs_2d[i])
The t-SNE plot reveals clusters that no one programmed: animals group together, vehicles group together, royalty groups together. Gender-related words form parallel lines (king/queen, prince/princess, man/woman). This emergent structure is the hallmark of distributional word vectors.
Word embeddings existed before Word2Vec. Bengio et al. (2003) had shown that neural networks could learn word vectors. Why did Word2Vec — not the 2003 paper — trigger the revolution?
The best configuration from the paper:
| Model | Training time | Why |
|---|---|---|
| NNLM (Bengio 2003) | Weeks on CPU | Hidden layer + full softmax over V |
| RNNLM | Days on CPU | Recurrent computation is sequential |
| CBOW | Hours on CPU | No hidden layer, hier. softmax |
| Skip-gram | ~1 day on CPU | Shallow model + hier. softmax + multi-threading |
Word2Vec's shallow architecture enables training that is orders of magnitude faster than prior neural language models on the same data.
Word2Vec achieved a 100-1000x speedup over prior neural language models. This speedup came entirely from simplifying the model (removing the hidden layer), not from better hardware. The lesson: simpler models can learn from more data, and more data wins.
Neural language models (Bengio et al., 2003): The first neural network to learn word embeddings as a byproduct of language modeling. Word2Vec stripped away the hidden layers and the language modeling objective, keeping only the embedding learning — making it vastly more scalable.
Latent Semantic Analysis (LSA) (Deerwester et al., 1990): The original count-based approach to word vectors. Build a word-document co-occurrence matrix and reduce dimensionality with SVD. Word2Vec achieved better results by predicting rather than counting.
Distributional semantics (Firth, 1957; Harris, 1954): The linguistic theory that meaning comes from usage patterns. Word2Vec operationalized this with a neural network that turned the theory into a practical algorithm.
Word2Vec extensions (Mikolov et al., 2013b): The follow-up paper introducing negative sampling, subsampling, and phrase vectors. Covered in our companion veanor.
GloVe (Pennington et al., 2014): Showed that Word2Vec's Skip-gram implicitly factorizes a word-context co-occurrence matrix. GloVe made this explicit by directly optimizing a weighted least-squares objective on co-occurrence counts. See our GloVe veanor.
FastText (Bojanowski et al., 2017): Extended Word2Vec to use character n-grams, enabling embeddings for out-of-vocabulary words. The vector for "unfairness" is the sum of subword vectors like "un," "fair," "ness."
ELMo (Peters et al., 2018): Moved beyond static word vectors to context-dependent representations using a bidirectional LSTM. "Bank" gets different vectors in "river bank" vs. "bank account."
BERT and Transformers (Devlin et al., 2018): Transformers produce fully contextualized embeddings. Every token gets a unique vector based on the full input. The embedding layer of a transformer is essentially a Word2Vec-style lookup table — the connection is direct.
The very first layer of every transformer (GPT, BERT, LLaMA) is an embedding lookup table — exactly a Word2Vec-style matrix W of shape (V, d). Token ID in, dense vector out. The difference is what happens after: transformers stack attention layers on top of the initial embeddings, producing context-dependent representations. But the initial embedding is still a static lookup, learned end-to-end.
python import torch.nn as nn # In a transformer, the first layer is literally Word2Vec's W_in: embedding = nn.Embedding(vocab_size, d_model) # shape: (V, d) # Token IDs → dense vectors, exactly like Word2Vec token_ids = torch.tensor([42, 17, 8391]) embeddings = embedding(token_ids) # shape: (3, d_model) # Then transformers add positional encoding + self-attention layers # Word2Vec stops here; transformers build context on top