CS224N Lecture 2

Word Vectors

Representing meaning as geometry — how machines learn that "cat" and "kitten" are neighbors.

Prerequisites: L01 History (recommended) + basic linear algebra (dot products). That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Vectors?

You type "cat" into a search engine. It returns results about cats, kittens, felines, and "adopt a pet." But to a computer, "cat" is just number 4,817 in a 50,000-word dictionary. And "kitten" is number 23,401. Those two numbers are just as far apart as "cat" and "quantum." The computer has no idea they're related.

This is the fundamental problem of word representation. How do you encode a word so that a machine can tell which words are similar and which are not?

The naive approach is called one-hot encoding. Each word gets a vector with a single 1 and all other entries 0. If your vocabulary has 50,000 words, then "cat" is a 50,000-dimensional vector with a 1 in position 4,817 and zeros everywhere else. "Dog" has a 1 in position 12,045. "Quantum" has a 1 in position 38,772.

Now try to compute similarity. The dot product of any two one-hot vectors is zero — because they never have a 1 in the same position. Every word is equally distant from every other word. "Cat" is as far from "dog" as from "democracy." This makes any downstream task — search, translation, question answering — nearly impossible.

The one-hot catastrophe: In one-hot space, "cat" is just as far from "dog" as from "quantum." There are no neighbors, no clusters, no relationships. Every word is an island. Dense embeddings put similar words on the same continent.

The solution is dense word vectors (also called word embeddings). Instead of a 50,000-dimensional vector with one 1, we represent each word as a short, dense vector — say 300 numbers. These numbers are learned from data, and the magic is: words that appear in similar contexts end up with similar vectors. "Cat" and "kitten" land close together. "Cat" and "quantum" end up far apart.

This lesson covers how those dense vectors are learned. We'll build from the raw insight ("words that appear in similar contexts have similar meanings") through the two dominant algorithms (Word2Vec and GloVe), to the surprising emergent property that makes embeddings famous: vector arithmetic. King − man + woman ≈ queen.

What this lesson covers: Co-occurrence and the distributional hypothesis. Word2Vec (CBOW and Skip-gram). Negative sampling. GloVe. Word analogies. Evaluation and bias in embeddings. By the end, you'll understand how every NLP system converts raw text into numbers.
One-Hot vs. Embedding Space

Toggle between one-hot representation (all words equidistant on a circle) and embedding space (similar words cluster together). Hover over any word to see its distances to other words.

Click a mode above. Hover words to see distances.
Why can't we compute meaningful similarity between one-hot vectors?

Chapter 1: Co-occurrence — You Are the Company You Keep

"I adopted a cute _____ from the shelter." You know the blank is "cat" or "dog" or "rabbit." Not "carburetor." Not "theorem." The surrounding words act like a fingerprint for meaning. This observation — that a word's meaning is determined by the words that appear near it — is called the distributional hypothesis.

The idea goes back to linguist J.R. Firth, who wrote in 1957: "You shall know a word by the company it keeps." If "coffee" and "tea" consistently appear near "drink," "morning," "cup," and "hot," then they must mean similar things. If "bank" appears near both "river" and "money," that's a clue it has multiple meanings.

To make this concrete, we define a context window — a fixed number of words before and after a target word. With a window size of 2, in the sentence "The cat sat on the mat," the context of "sat" is {"cat", "on"}. The context of "cat" is {"The", "sat"}.

We then build a co-occurrence matrix. Each row is a word, each column is a word, and each entry counts how many times those two words appeared within the same context window across a large corpus. If "coffee" and "cup" co-occur 847 times, that number goes in cell (coffee, cup).

The distributional hypothesis is the foundation of ALL modern embeddings. Word2Vec, GloVe, ELMo, BERT, GPT — every method that learns word representations is exploiting the same core idea: meaning is revealed by context. The methods differ in HOW they exploit it, but the underlying principle is always Firth's insight from 1957.

There's a problem with raw co-occurrence counts, though. Common words like "the," "is," and "of" co-occur with everything. They dominate the matrix without providing useful signal. The word "the" might co-occur with "coffee" 5,000 times — but that tells you nothing about coffee. Later methods (TF-IDF, PMI, GloVe) address this by downweighting frequent co-occurrences. But the basic matrix already contains a surprising amount of structure.

If you take each row of this matrix as a word's vector, words with similar row patterns will end up close in vector space. Not optimally close, but close. The entire field of word embeddings is about finding better ways to extract and compress the signal in this matrix.

Context Window Scanner

Slide the center word through the sentence. The context window highlights neighbors. Co-occurrence counts accumulate in the matrix below. Switch sentences with the buttons.

Center wordsat
If "coffee" and "tea" appear in similar contexts, what does the distributional hypothesis predict?

Chapter 2: Word2Vec CBOW — Context Predicts the Word

I give you four words — "the," "cat," "on," "mat" — which word goes in the middle? "Sat." You just did Continuous Bag of Words (CBOW). The model sees context words and tries to predict the center word. By doing this millions of times on a large corpus, the model learns embeddings that encode meaning.

Tomas Mikolov introduced Word2Vec in 2013, and CBOW is one of its two training objectives. Here's how it works, step by step:

Step 1: Embedding lookup. Each context word is converted from a one-hot vector to a dense embedding by multiplying with the embedding matrix W. If our vocabulary has V words and our embedding dimension is d, then W is a [V × d] matrix. Looking up word i means grabbing row i of W. This is just a matrix lookup — no multiplication needed.

Step 2: Average. All context embeddings are averaged into a single vector. If we have 4 context words, each a 300-dimensional vector, the average is also 300-dimensional. This is the "bag of words" part — order doesn't matter.

Step 3: Projection. The averaged vector is multiplied by a second matrix W′ of shape [d × V], producing a V-dimensional score vector. Each entry is a score for how likely that vocabulary word is to be the center word.

Step 4: Softmax. The scores are passed through a softmax function to produce a probability distribution over the entire vocabulary. The word with the highest probability is the model's prediction.

p(wcenter | context) = softmax(W′ · (1/|C|) ∑c ∈ C W[c])

The loss is negative log-likelihood: we want to maximize the probability of the true center word. Gradients flow back through W′ and W, updating both matrices. After training, we throw away W′ and keep W — the rows of W are our word embeddings.

CBOW throws away word order — it's a BAG of words. "The cat sat on" and "on sat cat the" produce the same average embedding. This seems like a flaw, but it's actually a feature: by ignoring order, the model is forced to learn meaning, not position. The embeddings capture "what kind of word goes here?" not "what word goes in position 3?"

Concrete Shapes

Let's nail down every tensor shape. Suppose V = 50,000 (vocabulary size), d = 300 (embedding dimension), and the context window has 4 words:

ObjectShapeWhat it is
One-hot input[V] = [50000]Each context word as one-hot
W (embedding)[V, d] = [50000, 300]Embedding matrix (the thing we want)
Context embeddings[4, d] = [4, 300]4 looked-up rows of W
Average[d] = [300]Mean of context embeddings
W′ (projection)[d, V] = [300, 50000]Maps back to vocabulary space
Scores[V] = [50000]Raw logits for each word
Output[V] = [50000]Softmax probabilities

From-Scratch Code

python
import numpy as np

# Shapes: V=50000, d=300, window=2 (4 context words)
V, d = 50000, 300
W  = np.random.randn(V, d) * 0.01   # embedding matrix [V, d]
Wp = np.random.randn(d, V) * 0.01   # projection matrix [d, V]

def cbow_forward(context_ids):
    # context_ids: list of 4 integers
    embeds = W[context_ids]           # [4, 300] — lookup, not multiply
    avg    = embeds.mean(axis=0)       # [300] — average context
    scores = avg @ Wp                  # [50000] — project to vocab
    probs  = softmax(scores)            # [50000] — probability dist
    return probs

def softmax(x):
    e = np.exp(x - x.max())
    return e / e.sum()
CBOW Forward Pass

Click "Next Step" to walk through the 4 stages of a CBOW forward pass. Tensor shapes are shown at each stage.

Click "Next Step" to begin the forward pass.
In CBOW, what is the model trying to predict?

Chapter 3: Skip-gram — One Word Describes Its Neighborhood

CBOW asks: "Given the neighborhood, who lives here?" Skip-gram flips the question: "Given who lives here, describe the neighborhood." Instead of predicting the center word from context, we predict each context word from the center word.

Given the center word "sat" and a window of 2, Skip-gram generates four training pairs: (sat → the), (sat → cat), (sat → on), (sat → the). Each pair asks: "Given 'sat,' can you predict 'cat'?" "Given 'sat,' can you predict 'on'?" The model sees only one word at a time and must predict each neighbor independently.

This seemingly small change has a profound consequence. CBOW generates one training example per window position: (context → center). Skip-gram generates 2×window training examples: one for each (center → context) pair. With a window of 5, that's 10 training pairs per position instead of 1.

p(wcontext | wcenter) = softmax(W′ · W[center])

The objective maximizes the probability of every observed (center, context) pair:

J(θ) = −(1/T) ∑t=1T-m ≤ j ≤ m, j≠0 log p(wt+j | wt)

Where T is the total number of words in the corpus and m is the window size. This is just the average negative log-probability of predicting each context word.

Why Skip-gram works better for rare words: A rare word like "aardvark" might appear only 5 times in the corpus. Under CBOW, it generates just 5 training examples (one per occurrence). Under Skip-gram, with window = 5, it generates 50 examples (10 per occurrence). More gradient updates mean a better embedding for "aardvark." This is why Skip-gram consistently outperforms CBOW on rare words.

CBOW vs. Skip-gram: Side by Side

PropertyCBOWSkip-gram
InputContext words (many)Center word (one)
OutputCenter word (one)Context words (many)
Training pairs per position12 × window
SpeedFaster (fewer examples)Slower (more examples)
Rare wordsWorse (few updates)Better (more updates)
Frequent wordsBetter (averaging smooths)Okay

In practice, Skip-gram with negative sampling (which we'll cover next chapter) became the default Word2Vec configuration. Most pre-trained Word2Vec embeddings you'll find online use this setup.

Skip-gram vs. CBOW

Click words in the sentence to see training pairs generated by each method. Left: CBOW (many → one). Right: Skip-gram (one → many).

Click any word in the sentence above to compare the two methods.
Why does Skip-gram work better for rare words than CBOW?

Chapter 4: Negative Sampling — The Trick That Makes It Possible

There's a fatal flaw in everything we've described. The softmax denominator sums over EVERY word in the vocabulary:

p(wo | wc) = exp(uoT vc) / ∑w=1V exp(uwT vc)

V is 50,000 or more. For every single training example, you compute 50,000 dot products, exponentiate them all, and sum them up. Then you compute gradients for all 50,000 outputs. That's not slow — it's impossible at scale. A corpus of a billion words with a vocabulary of 100,000 means 100 trillion dot products just for the denominators.

Negative sampling (Mikolov et al., 2013) sidesteps this entirely. Instead of asking "what's the probability of the correct word among all V words?", it asks a simpler question: "Can you tell the correct word apart from K random words?"

Here's the reformulation. Given a center word wc and a true context word wo, maximize:

J = log σ(uoT vc) + ∑k=1K Ewk ~ P(w) [log σ(−ukT vc)]

The first term says: make the dot product of the true pair (wc, wo) large and positive (sigmoid → 1). The second term says: for K randomly sampled "negative" words, make their dot products small (sigmoid of negative → 1, so sigmoid of positive → 0).

Where σ is the sigmoid function: σ(x) = 1 / (1 + exp(−x)). It maps any real number to [0, 1] — perfect for binary classification.

The core insight: Negative sampling converts a V-class classification problem into K+1 binary classification problems. Instead of asking "which of 50,000 words is correct?" it asks "is this pair real or fake?" K+1 dot products instead of V. With K = 15, that's a 3,000× speedup. The approximation works because most words are irrelevant for any given context — we only need to push apart the ones we sample.

How Negatives Are Sampled

Negative words are sampled from a noise distribution P(w) = count(w)3/4 / Z. The 3/4 exponent is crucial — it's between uniform (0) and frequency-proportional (1). Pure frequency-proportional sampling would oversample "the" and "is." Uniform would waste time on ultra-rare words. The 3/4 power smooths the distribution, giving rare words a fighting chance while still preferring common ones.

Mikolov found that K = 5–20 works well for small datasets, and K = 2–5 suffices for large ones. More data means less noise in the gradients, so fewer negatives are needed.

From-Scratch Implementation

python
import numpy as np

def neg_sampling_loss(center_vec, context_vec, neg_vecs):
    # center_vec: [d]   — embedding of center word
    # context_vec: [d]  — embedding of true context word
    # neg_vecs: [K, d]  — embeddings of K negative samples

    # Positive pair: push dot product UP
    pos_score = sigmoid(context_vec @ center_vec)   # scalar
    pos_loss  = -np.log(pos_score + 1e-10)

    # Negative pairs: push dot products DOWN
    neg_scores = sigmoid(-neg_vecs @ center_vec)    # [K]
    neg_loss   = -np.log(neg_scores + 1e-10).sum()

    return pos_loss + neg_loss

def sigmoid(x):
    return 1 / (1 + np.exp(-x))
Negative Sampling Visualizer

The center word (purple) should be pulled close to the true context (green) and pushed away from negative samples (red). Click "Sample Negatives" to draw new random negatives. Adjust K to change the number of negatives.

K (negatives)5
What does negative sampling replace?

Chapter 5: GloVe — Counting Meets Prediction

By 2014, there were two rival philosophies for building word vectors. The count-based camp said: build a big co-occurrence matrix, reduce its dimensionality (via SVD or similar), and use the compressed vectors. The prediction-based camp (Word2Vec) said: train a neural network to predict context words. Which was better?

Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford answered: they're the same thing. Their method, GloVe (Global Vectors for Word Representation), showed that Word2Vec implicitly factorizes a co-occurrence matrix. And they built a method that does this factorization directly, combining the efficiency of counting with the quality of prediction.

The Key Insight: Ratios, Not Counts

Consider the words "ice" and "steam." Both co-occur with "water." But the ratio of their co-occurrences with a third word reveals the relationship:

Probe word kP(k | ice)P(k | steam)Ratio P(k|ice) / P(k|steam)
solid1.9 × 10−42.2 × 10−58.9 (ice-related)
gas6.6 × 10−57.8 × 10−40.085 (steam-related)
water3.0 × 10−32.2 × 10−31.36 (both — neutral)
fashion1.7 × 10−51.8 × 10−50.96 (neither — neutral)

When the ratio is large (>>1), the probe word is ice-related. When it's small (<<1), it's steam-related. When it's ≈1, the probe is neutral. Raw counts can't distinguish these cases — "water" co-occurs a lot with both, so its raw count with "ice" and "steam" are both high. But the ratio tells you it's neutral.

GloVe's key insight: The RATIO of co-occurrence probabilities encodes meaning. Ratios discriminate; raw counts don't. GloVe's objective asks: find word vectors wi, wj such that wiT wj approximates log(Xij), the log co-occurrence count. This is a weighted least-squares problem — no softmax, no sampling, just matrix factorization.

The GloVe Objective

J = ∑i,j=1V f(Xij) (wiTj + bi + b̃j − log Xij

This looks simple because it is. We want the dot product of two word vectors (plus bias terms) to equal the log of their co-occurrence count. The f(Xij) is a weighting function that prevents frequent pairs from dominating:

f(x) = (x / xmax)α if x < xmax, else 1

With xmax = 100 and α = 0.75 (the defaults from the paper). This function clips the weight at 1 for pairs that co-occur more than 100 times, and ramps up smoothly for less-frequent pairs. Without it, "the-the" would dominate the entire objective.

GloVe Weighting Function

Drag the sliders to change xmax and α. Watch how the weighting curve changes — it controls how much influence frequent vs. rare co-occurrences have on the objective.

xmax100
α0.75
What problem does GloVe's weighting function f(Xij) solve?

Chapter 6: Word Analogies — Arithmetic on Meaning

Take the vector for "king." Subtract "man." Add "woman." The nearest word to the result? "Queen." The vectors learned, without any explicit supervision, that royalty and gender are separate dimensions of meaning. And you can do arithmetic on them.

This discovery — that word analogy tests could be solved by simple vector addition and subtraction — was one of the most surprising results in NLP. Mikolov et al. (2013) showed that trained Word2Vec embeddings consistently captured these linear relationships:

AnalogyVector arithmeticNearest result
king : queen :: man : ?v(king) − v(man) + v(woman)queen
Paris : France :: Tokyo : ?v(Paris) − v(France) + v(Japan)Tokyo
big : bigger :: small : ?v(big) − v(bigger) + v(smaller)small
walk : walked :: swim : ?v(walk) − v(walked) + v(swam)swim

Why Does This Work?

The key insight is that embeddings encode relationships as directions. The vector from "man" to "woman" points in a "gender direction." The vector from "king" to "queen" points in the same direction. So v(king) − v(man) ≈ v(queen) − v(woman), which rearranges to v(king) − v(man) + v(woman) ≈ v(queen).

Geometrically, this means four words related by two consistent relationships form a parallelogram in embedding space. The "man→woman" edge is parallel to the "king→queen" edge. The "man→king" edge (royalty direction) is parallel to the "woman→queen" edge.

Embeddings encode relationships as directions. The "gender direction" is the same vector whether you compute king→queen, uncle→aunt, boy→girl, or actor→actress. The "country→capital" direction is the same for France→Paris, Japan→Tokyo, and Egypt→Cairo. These directions emerge automatically from co-occurrence patterns — nobody told the model about gender or geography.

The Math: Cosine Similarity

To find the word nearest to a query vector, we use cosine similarity:

cos(a, b) = (a · b) / (||a|| · ||b||)

This measures the angle between two vectors, ignoring magnitude. It ranges from −1 (opposite) through 0 (orthogonal) to +1 (identical direction). For the analogy "king : queen :: man : ?", we compute q = v(king) − v(man) + v(woman), then find argmaxw cos(q, v(w)), excluding the input words.

python
def analogy(a, b, c, embeddings, vocab):
    # a:b :: c:?  →  ? = b - a + c
    query = embeddings[vocab[b]] - embeddings[vocab[a]] + embeddings[vocab[c]]

    # Cosine similarity against every word
    norms = np.linalg.norm(embeddings, axis=1)
    sims  = embeddings @ query / (norms * np.linalg.norm(query) + 1e-10)

    # Exclude input words
    for w in [a, b, c]:
        sims[vocab[w]] = -1

    return list(vocab.keys())[np.argmax(sims)]
Analogy Calculator

Select an analogy to see vector arithmetic visualized as a parallelogram. The dashed arrow shows the query vector; the nearest word to its tip is the answer.

Why does v(king) − v(man) + v(woman) ≈ v(queen)?

Chapter 7: Embedding Explorer

Time to get your hands dirty. Below is a 2D projection of a word embedding space with 60+ words. Each dot is a word, colored by category: animals, countries, verbs, professions, food & drink, and adjectives.

Click any word to see its 5 nearest neighbors highlighted with connecting lines. The distances shown are cosine similarities — higher means more similar. Notice how words cluster by meaning: animals near animals, countries near countries.

Then try Analogy Mode. Click three words to perform a − b + c arithmetic. The predicted answer appears as a starred dot. Does the parallelogram intuition hold up?

Use the category toggles to focus on specific groups. Search for a word by name. Drag to pan, scroll to zoom (or pinch on mobile).

Look for "turkey" — it appears near both countries AND birds. This is polysemy: one word, multiple meanings. Static embeddings like Word2Vec and GloVe collapse all meanings into a single vector. The word "bank" (financial vs. river) has the same problem. Contextual embeddings (ELMo, BERT) solve this by giving each occurrence a different vector — but that's Lecture 8.
Embedding Explorer

Click a word to see nearest neighbors. Use "Analogy Mode" to test vector arithmetic.

Click any word to start exploring.

Chapter 8: Evaluation & Pitfalls

You've built beautiful word vectors. But how do you KNOW they're good? And what happens when they encode human biases? "Man is to computer programmer as woman is to homemaker" — that's a real result from Word2Vec trained on Google News. The vectors didn't invent that bias. They faithfully learned it from the data.

Intrinsic vs. Extrinsic Evaluation

Intrinsic evaluation tests embeddings in isolation. Word analogy tests ("king:queen::man:?"), word similarity benchmarks (SimLex-999, WordSim-353), and clustering quality. These are fast to compute and provide a sanity check. But they have a dangerous limitation: good intrinsic scores don't guarantee good downstream performance.

Extrinsic evaluation tests embeddings inside a real system. Does switching from GloVe to Word2Vec improve your named entity recognizer? Does it improve your sentiment classifier? This is slower and noisier, but it's the only test that matters for production.

The evaluation trap: A team spends weeks optimizing embeddings for analogy accuracy. Their analogy score goes from 75% to 82%. They plug the new embeddings into their translation system. Performance doesn't change. Why? Because the analogies they tested (king:queen, Paris:France) don't represent the kind of semantic knowledge their translator needs. Always evaluate on YOUR task.

Bias in Embeddings

Bolukbasi et al. (2016) demonstrated that Word2Vec embeddings trained on Google News encode systematic gender stereotypes. The embedding space has a "gender direction" (the vector from "he" to "she"), and many occupation words are displaced along this direction in stereotypical ways:

WordCloser to "he"Closer to "she"
programmer
homemaker
doctor
nurse
architect
librarian

The embeddings aren't "wrong" — they accurately reflect the biases in the training data. But when these embeddings are used in hiring algorithms, search rankings, or loan applications, they amplify and perpetuate existing societal biases. A resume screening tool using biased embeddings might rank "he programmed in C++" higher than "she programmed in C++" even though they describe the same skill.

Debiasing Methods

Hard debiasing (Bolukbasi et al., 2016) identifies the "gender direction" via PCA on gendered word pairs (he/she, man/woman, king/queen), then projects all non-gendered words to be equidistant from the gender subspace. Conceptual: if "doctor" is currently displaced toward "he," move it to the midplane so it's equally close to both.

This works for the specific bias dimension you identify, but critics note that other biases may remain in dimensions you didn't think to test. Bias in embeddings is an active research area with no perfect solution.

Bias in Embeddings

The vertical axis represents the "gender direction." Occupations displaced upward are closer to "he," downward closer to "she." Toggle between biased (2013) and debiased views. Drag the slider to see partial debiasing.

Debiasing strength0%
Why is evaluating embeddings only on word analogies insufficient?

Chapter 9: Connections

Word vectors are the foundation. Every neural NLP model — from simple sentiment classifiers to GPT-4 — starts by converting words (or subwords) into dense vectors. What we covered in this lesson is the first generation: static embeddings where each word gets one vector regardless of context.

The next generation — contextual embeddings (ELMo, BERT, GPT) — gives each word a different vector depending on its sentence context. "Bank" near "river" gets a different embedding than "bank" near "money." But the training principle is the same: predict words from their neighbors.

One-Hot
Every word is an island. No similarity. No relationships.
Word2Vec / GloVe (2013–14)
Dense vectors from co-occurrence. Analogies emerge. One vector per word.
ELMo (2018)
Contextual embeddings from bidirectional LSTMs. Different vector per context.
BERT / GPT (2018–19)
Transformer-based. Attention replaces RNNs. Pre-train on massive data.
Foundation Models (2020+)
Scale up transformers. GPT-3, GPT-4, LLaMA. Emergent abilities at scale.

Deep-Dive Papers

This lesson covered the concepts. For the full mathematical and experimental details, explore these papers in the Veanors section:

PaperKey contributionLink
Word2Vec (Mikolov 2013)CBOW and Skip-gram architecturesRead →
Negative Sampling (Mikolov 2013)Efficient training via negative samplesRead →
GloVe (Pennington 2014)Co-occurrence matrix factorizationRead →
Tuning, Not Model (Levy 2015)Hyperparameters matter more than architectureRead →
Evaluation Methods (Schnabel 2015)How to evaluate embeddings properlyRead →
Why Word2Vec Works (Arora 2016)Theoretical analysis of Skip-gramRead →
Polysemy as Superposition (Arora 2018)Multiple meanings as vector sumsRead →
Optimal Dimensions (Yin 2018)How many dimensions do you need?Read →

Method Comparison

MethodYearTypeContextual?Handles polysemy?
Word2Vec2013PredictionNoNo
GloVe2014Count + PredictionNoNo
FastText2016Prediction (subword)NoNo (but handles OOV)
ELMo2018BiLSTM language modelYesYes
BERT2019Transformer MLMYesYes

Where to Go Next

These lessons connect directly to what you've learned:

"You shall know a word by the company it keeps." — J.R. Firth, 1957. Every embedding method since has been a footnote to this insight.