Representing meaning as geometry — how machines learn that "cat" and "kitten" are neighbors.
You type "cat" into a search engine. It returns results about cats, kittens, felines, and "adopt a pet." But to a computer, "cat" is just number 4,817 in a 50,000-word dictionary. And "kitten" is number 23,401. Those two numbers are just as far apart as "cat" and "quantum." The computer has no idea they're related.
This is the fundamental problem of word representation. How do you encode a word so that a machine can tell which words are similar and which are not?
The naive approach is called one-hot encoding. Each word gets a vector with a single 1 and all other entries 0. If your vocabulary has 50,000 words, then "cat" is a 50,000-dimensional vector with a 1 in position 4,817 and zeros everywhere else. "Dog" has a 1 in position 12,045. "Quantum" has a 1 in position 38,772.
Now try to compute similarity. The dot product of any two one-hot vectors is zero — because they never have a 1 in the same position. Every word is equally distant from every other word. "Cat" is as far from "dog" as from "democracy." This makes any downstream task — search, translation, question answering — nearly impossible.
The solution is dense word vectors (also called word embeddings). Instead of a 50,000-dimensional vector with one 1, we represent each word as a short, dense vector — say 300 numbers. These numbers are learned from data, and the magic is: words that appear in similar contexts end up with similar vectors. "Cat" and "kitten" land close together. "Cat" and "quantum" end up far apart.
This lesson covers how those dense vectors are learned. We'll build from the raw insight ("words that appear in similar contexts have similar meanings") through the two dominant algorithms (Word2Vec and GloVe), to the surprising emergent property that makes embeddings famous: vector arithmetic. King − man + woman ≈ queen.
Toggle between one-hot representation (all words equidistant on a circle) and embedding space (similar words cluster together). Hover over any word to see its distances to other words.
"I adopted a cute _____ from the shelter." You know the blank is "cat" or "dog" or "rabbit." Not "carburetor." Not "theorem." The surrounding words act like a fingerprint for meaning. This observation — that a word's meaning is determined by the words that appear near it — is called the distributional hypothesis.
The idea goes back to linguist J.R. Firth, who wrote in 1957: "You shall know a word by the company it keeps." If "coffee" and "tea" consistently appear near "drink," "morning," "cup," and "hot," then they must mean similar things. If "bank" appears near both "river" and "money," that's a clue it has multiple meanings.
To make this concrete, we define a context window — a fixed number of words before and after a target word. With a window size of 2, in the sentence "The cat sat on the mat," the context of "sat" is {"cat", "on"}. The context of "cat" is {"The", "sat"}.
We then build a co-occurrence matrix. Each row is a word, each column is a word, and each entry counts how many times those two words appeared within the same context window across a large corpus. If "coffee" and "cup" co-occur 847 times, that number goes in cell (coffee, cup).
There's a problem with raw co-occurrence counts, though. Common words like "the," "is," and "of" co-occur with everything. They dominate the matrix without providing useful signal. The word "the" might co-occur with "coffee" 5,000 times — but that tells you nothing about coffee. Later methods (TF-IDF, PMI, GloVe) address this by downweighting frequent co-occurrences. But the basic matrix already contains a surprising amount of structure.
If you take each row of this matrix as a word's vector, words with similar row patterns will end up close in vector space. Not optimally close, but close. The entire field of word embeddings is about finding better ways to extract and compress the signal in this matrix.
Slide the center word through the sentence. The context window highlights neighbors. Co-occurrence counts accumulate in the matrix below. Switch sentences with the buttons.
I give you four words — "the," "cat," "on," "mat" — which word goes in the middle? "Sat." You just did Continuous Bag of Words (CBOW). The model sees context words and tries to predict the center word. By doing this millions of times on a large corpus, the model learns embeddings that encode meaning.
Tomas Mikolov introduced Word2Vec in 2013, and CBOW is one of its two training objectives. Here's how it works, step by step:
Step 1: Embedding lookup. Each context word is converted from a one-hot vector to a dense embedding by multiplying with the embedding matrix W. If our vocabulary has V words and our embedding dimension is d, then W is a [V × d] matrix. Looking up word i means grabbing row i of W. This is just a matrix lookup — no multiplication needed.
Step 2: Average. All context embeddings are averaged into a single vector. If we have 4 context words, each a 300-dimensional vector, the average is also 300-dimensional. This is the "bag of words" part — order doesn't matter.
Step 3: Projection. The averaged vector is multiplied by a second matrix W′ of shape [d × V], producing a V-dimensional score vector. Each entry is a score for how likely that vocabulary word is to be the center word.
Step 4: Softmax. The scores are passed through a softmax function to produce a probability distribution over the entire vocabulary. The word with the highest probability is the model's prediction.
The loss is negative log-likelihood: we want to maximize the probability of the true center word. Gradients flow back through W′ and W, updating both matrices. After training, we throw away W′ and keep W — the rows of W are our word embeddings.
Let's nail down every tensor shape. Suppose V = 50,000 (vocabulary size), d = 300 (embedding dimension), and the context window has 4 words:
| Object | Shape | What it is |
|---|---|---|
| One-hot input | [V] = [50000] | Each context word as one-hot |
| W (embedding) | [V, d] = [50000, 300] | Embedding matrix (the thing we want) |
| Context embeddings | [4, d] = [4, 300] | 4 looked-up rows of W |
| Average | [d] = [300] | Mean of context embeddings |
| W′ (projection) | [d, V] = [300, 50000] | Maps back to vocabulary space |
| Scores | [V] = [50000] | Raw logits for each word |
| Output | [V] = [50000] | Softmax probabilities |
python import numpy as np # Shapes: V=50000, d=300, window=2 (4 context words) V, d = 50000, 300 W = np.random.randn(V, d) * 0.01 # embedding matrix [V, d] Wp = np.random.randn(d, V) * 0.01 # projection matrix [d, V] def cbow_forward(context_ids): # context_ids: list of 4 integers embeds = W[context_ids] # [4, 300] — lookup, not multiply avg = embeds.mean(axis=0) # [300] — average context scores = avg @ Wp # [50000] — project to vocab probs = softmax(scores) # [50000] — probability dist return probs def softmax(x): e = np.exp(x - x.max()) return e / e.sum()
Click "Next Step" to walk through the 4 stages of a CBOW forward pass. Tensor shapes are shown at each stage.
CBOW asks: "Given the neighborhood, who lives here?" Skip-gram flips the question: "Given who lives here, describe the neighborhood." Instead of predicting the center word from context, we predict each context word from the center word.
Given the center word "sat" and a window of 2, Skip-gram generates four training pairs: (sat → the), (sat → cat), (sat → on), (sat → the). Each pair asks: "Given 'sat,' can you predict 'cat'?" "Given 'sat,' can you predict 'on'?" The model sees only one word at a time and must predict each neighbor independently.
This seemingly small change has a profound consequence. CBOW generates one training example per window position: (context → center). Skip-gram generates 2×window training examples: one for each (center → context) pair. With a window of 5, that's 10 training pairs per position instead of 1.
The objective maximizes the probability of every observed (center, context) pair:
Where T is the total number of words in the corpus and m is the window size. This is just the average negative log-probability of predicting each context word.
| Property | CBOW | Skip-gram |
|---|---|---|
| Input | Context words (many) | Center word (one) |
| Output | Center word (one) | Context words (many) |
| Training pairs per position | 1 | 2 × window |
| Speed | Faster (fewer examples) | Slower (more examples) |
| Rare words | Worse (few updates) | Better (more updates) |
| Frequent words | Better (averaging smooths) | Okay |
In practice, Skip-gram with negative sampling (which we'll cover next chapter) became the default Word2Vec configuration. Most pre-trained Word2Vec embeddings you'll find online use this setup.
Click words in the sentence to see training pairs generated by each method. Left: CBOW (many → one). Right: Skip-gram (one → many).
There's a fatal flaw in everything we've described. The softmax denominator sums over EVERY word in the vocabulary:
V is 50,000 or more. For every single training example, you compute 50,000 dot products, exponentiate them all, and sum them up. Then you compute gradients for all 50,000 outputs. That's not slow — it's impossible at scale. A corpus of a billion words with a vocabulary of 100,000 means 100 trillion dot products just for the denominators.
Negative sampling (Mikolov et al., 2013) sidesteps this entirely. Instead of asking "what's the probability of the correct word among all V words?", it asks a simpler question: "Can you tell the correct word apart from K random words?"
Here's the reformulation. Given a center word wc and a true context word wo, maximize:
The first term says: make the dot product of the true pair (wc, wo) large and positive (sigmoid → 1). The second term says: for K randomly sampled "negative" words, make their dot products small (sigmoid of negative → 1, so sigmoid of positive → 0).
Where σ is the sigmoid function: σ(x) = 1 / (1 + exp(−x)). It maps any real number to [0, 1] — perfect for binary classification.
Negative words are sampled from a noise distribution P(w) = count(w)3/4 / Z. The 3/4 exponent is crucial — it's between uniform (0) and frequency-proportional (1). Pure frequency-proportional sampling would oversample "the" and "is." Uniform would waste time on ultra-rare words. The 3/4 power smooths the distribution, giving rare words a fighting chance while still preferring common ones.
Mikolov found that K = 5–20 works well for small datasets, and K = 2–5 suffices for large ones. More data means less noise in the gradients, so fewer negatives are needed.
python import numpy as np def neg_sampling_loss(center_vec, context_vec, neg_vecs): # center_vec: [d] — embedding of center word # context_vec: [d] — embedding of true context word # neg_vecs: [K, d] — embeddings of K negative samples # Positive pair: push dot product UP pos_score = sigmoid(context_vec @ center_vec) # scalar pos_loss = -np.log(pos_score + 1e-10) # Negative pairs: push dot products DOWN neg_scores = sigmoid(-neg_vecs @ center_vec) # [K] neg_loss = -np.log(neg_scores + 1e-10).sum() return pos_loss + neg_loss def sigmoid(x): return 1 / (1 + np.exp(-x))
The center word (purple) should be pulled close to the true context (green) and pushed away from negative samples (red). Click "Sample Negatives" to draw new random negatives. Adjust K to change the number of negatives.
By 2014, there were two rival philosophies for building word vectors. The count-based camp said: build a big co-occurrence matrix, reduce its dimensionality (via SVD or similar), and use the compressed vectors. The prediction-based camp (Word2Vec) said: train a neural network to predict context words. Which was better?
Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford answered: they're the same thing. Their method, GloVe (Global Vectors for Word Representation), showed that Word2Vec implicitly factorizes a co-occurrence matrix. And they built a method that does this factorization directly, combining the efficiency of counting with the quality of prediction.
Consider the words "ice" and "steam." Both co-occur with "water." But the ratio of their co-occurrences with a third word reveals the relationship:
| Probe word k | P(k | ice) | P(k | steam) | Ratio P(k|ice) / P(k|steam) |
|---|---|---|---|
| solid | 1.9 × 10−4 | 2.2 × 10−5 | 8.9 (ice-related) |
| gas | 6.6 × 10−5 | 7.8 × 10−4 | 0.085 (steam-related) |
| water | 3.0 × 10−3 | 2.2 × 10−3 | 1.36 (both — neutral) |
| fashion | 1.7 × 10−5 | 1.8 × 10−5 | 0.96 (neither — neutral) |
When the ratio is large (>>1), the probe word is ice-related. When it's small (<<1), it's steam-related. When it's ≈1, the probe is neutral. Raw counts can't distinguish these cases — "water" co-occurs a lot with both, so its raw count with "ice" and "steam" are both high. But the ratio tells you it's neutral.
This looks simple because it is. We want the dot product of two word vectors (plus bias terms) to equal the log of their co-occurrence count. The f(Xij) is a weighting function that prevents frequent pairs from dominating:
With xmax = 100 and α = 0.75 (the defaults from the paper). This function clips the weight at 1 for pairs that co-occur more than 100 times, and ramps up smoothly for less-frequent pairs. Without it, "the-the" would dominate the entire objective.
Drag the sliders to change xmax and α. Watch how the weighting curve changes — it controls how much influence frequent vs. rare co-occurrences have on the objective.
Take the vector for "king." Subtract "man." Add "woman." The nearest word to the result? "Queen." The vectors learned, without any explicit supervision, that royalty and gender are separate dimensions of meaning. And you can do arithmetic on them.
This discovery — that word analogy tests could be solved by simple vector addition and subtraction — was one of the most surprising results in NLP. Mikolov et al. (2013) showed that trained Word2Vec embeddings consistently captured these linear relationships:
| Analogy | Vector arithmetic | Nearest result |
|---|---|---|
| king : queen :: man : ? | v(king) − v(man) + v(woman) | queen |
| Paris : France :: Tokyo : ? | v(Paris) − v(France) + v(Japan) | Tokyo |
| big : bigger :: small : ? | v(big) − v(bigger) + v(smaller) | small |
| walk : walked :: swim : ? | v(walk) − v(walked) + v(swam) | swim |
The key insight is that embeddings encode relationships as directions. The vector from "man" to "woman" points in a "gender direction." The vector from "king" to "queen" points in the same direction. So v(king) − v(man) ≈ v(queen) − v(woman), which rearranges to v(king) − v(man) + v(woman) ≈ v(queen).
Geometrically, this means four words related by two consistent relationships form a parallelogram in embedding space. The "man→woman" edge is parallel to the "king→queen" edge. The "man→king" edge (royalty direction) is parallel to the "woman→queen" edge.
To find the word nearest to a query vector, we use cosine similarity:
This measures the angle between two vectors, ignoring magnitude. It ranges from −1 (opposite) through 0 (orthogonal) to +1 (identical direction). For the analogy "king : queen :: man : ?", we compute q = v(king) − v(man) + v(woman), then find argmaxw cos(q, v(w)), excluding the input words.
python def analogy(a, b, c, embeddings, vocab): # a:b :: c:? → ? = b - a + c query = embeddings[vocab[b]] - embeddings[vocab[a]] + embeddings[vocab[c]] # Cosine similarity against every word norms = np.linalg.norm(embeddings, axis=1) sims = embeddings @ query / (norms * np.linalg.norm(query) + 1e-10) # Exclude input words for w in [a, b, c]: sims[vocab[w]] = -1 return list(vocab.keys())[np.argmax(sims)]
Select an analogy to see vector arithmetic visualized as a parallelogram. The dashed arrow shows the query vector; the nearest word to its tip is the answer.
Time to get your hands dirty. Below is a 2D projection of a word embedding space with 60+ words. Each dot is a word, colored by category: animals, countries, verbs, professions, food & drink, and adjectives.
Click any word to see its 5 nearest neighbors highlighted with connecting lines. The distances shown are cosine similarities — higher means more similar. Notice how words cluster by meaning: animals near animals, countries near countries.
Then try Analogy Mode. Click three words to perform a − b + c arithmetic. The predicted answer appears as a starred dot. Does the parallelogram intuition hold up?
Use the category toggles to focus on specific groups. Search for a word by name. Drag to pan, scroll to zoom (or pinch on mobile).
Click a word to see nearest neighbors. Use "Analogy Mode" to test vector arithmetic.
You've built beautiful word vectors. But how do you KNOW they're good? And what happens when they encode human biases? "Man is to computer programmer as woman is to homemaker" — that's a real result from Word2Vec trained on Google News. The vectors didn't invent that bias. They faithfully learned it from the data.
Intrinsic evaluation tests embeddings in isolation. Word analogy tests ("king:queen::man:?"), word similarity benchmarks (SimLex-999, WordSim-353), and clustering quality. These are fast to compute and provide a sanity check. But they have a dangerous limitation: good intrinsic scores don't guarantee good downstream performance.
Extrinsic evaluation tests embeddings inside a real system. Does switching from GloVe to Word2Vec improve your named entity recognizer? Does it improve your sentiment classifier? This is slower and noisier, but it's the only test that matters for production.
Bolukbasi et al. (2016) demonstrated that Word2Vec embeddings trained on Google News encode systematic gender stereotypes. The embedding space has a "gender direction" (the vector from "he" to "she"), and many occupation words are displaced along this direction in stereotypical ways:
| Word | Closer to "he" | Closer to "she" |
|---|---|---|
| programmer | ✓ | |
| homemaker | ✓ | |
| doctor | ✓ | |
| nurse | ✓ | |
| architect | ✓ | |
| librarian | ✓ |
The embeddings aren't "wrong" — they accurately reflect the biases in the training data. But when these embeddings are used in hiring algorithms, search rankings, or loan applications, they amplify and perpetuate existing societal biases. A resume screening tool using biased embeddings might rank "he programmed in C++" higher than "she programmed in C++" even though they describe the same skill.
Hard debiasing (Bolukbasi et al., 2016) identifies the "gender direction" via PCA on gendered word pairs (he/she, man/woman, king/queen), then projects all non-gendered words to be equidistant from the gender subspace. Conceptual: if "doctor" is currently displaced toward "he," move it to the midplane so it's equally close to both.
This works for the specific bias dimension you identify, but critics note that other biases may remain in dimensions you didn't think to test. Bias in embeddings is an active research area with no perfect solution.
The vertical axis represents the "gender direction." Occupations displaced upward are closer to "he," downward closer to "she." Toggle between biased (2013) and debiased views. Drag the slider to see partial debiasing.
Word vectors are the foundation. Every neural NLP model — from simple sentiment classifiers to GPT-4 — starts by converting words (or subwords) into dense vectors. What we covered in this lesson is the first generation: static embeddings where each word gets one vector regardless of context.
The next generation — contextual embeddings (ELMo, BERT, GPT) — gives each word a different vector depending on its sentence context. "Bank" near "river" gets a different embedding than "bank" near "money." But the training principle is the same: predict words from their neighbors.
This lesson covered the concepts. For the full mathematical and experimental details, explore these papers in the Veanors section:
| Paper | Key contribution | Link |
|---|---|---|
| Word2Vec (Mikolov 2013) | CBOW and Skip-gram architectures | Read → |
| Negative Sampling (Mikolov 2013) | Efficient training via negative samples | Read → |
| GloVe (Pennington 2014) | Co-occurrence matrix factorization | Read → |
| Tuning, Not Model (Levy 2015) | Hyperparameters matter more than architecture | Read → |
| Evaluation Methods (Schnabel 2015) | How to evaluate embeddings properly | Read → |
| Why Word2Vec Works (Arora 2016) | Theoretical analysis of Skip-gram | Read → |
| Polysemy as Superposition (Arora 2018) | Multiple meanings as vector sums | Read → |
| Optimal Dimensions (Yin 2018) | How many dimensions do you need? | Read → |
| Method | Year | Type | Contextual? | Handles polysemy? |
|---|---|---|---|---|
| Word2Vec | 2013 | Prediction | No | No |
| GloVe | 2014 | Count + Prediction | No | No |
| FastText | 2016 | Prediction (subword) | No | No (but handles OOV) |
| ELMo | 2018 | BiLSTM language model | Yes | Yes |
| BERT | 2019 | Transformer MLM | Yes | Yes |
These lessons connect directly to what you've learned: