A Latent Variable Model Approach to PMI-based Word Embeddings

Chapter 2: The PMI Derivation

Now we derive the central theoretical result: under the random walk model, the PMI between two words is approximately equal to the dot product of their word vectors. This explains why Word2Vec's Skip-gram objective (which Levy and Goldberg showed implicitly factorizes PMI) produces meaningful vectors.

Step 1: Joint probability

What is the probability that words w and w' co-occur within a small window? Both are emitted when the discourse vector is nearby. If the discourse vector is c, the joint probability (treating the two positions as approximately having the same c, since the walk is slow) is:

Pr[w, w' | c] = Pr[w | c] · Pr[w' | c]

= (exp(c · v_w) / Z_c) · (exp(c · v_w') / Z_c)

Step 2: Marginalizing over the discourse vector

The discourse vector c is a latent variable — we don't observe it. To get the unconditional co-occurrence probability, we average over all possible discourse vectors:

Pr[w, w'] = E_c[Pr[w | c] · Pr[w' | c]]

= E_c[exp(c · v_w + c · v_w') / Z_c²]

= E_c[exp(c · (v_w + v_w')) / Z_c²]

Step 3: The key approximation (constant partition function)

Under the isotropy assumption (word vectors uniformly distributed), the partition function Z_c is approximately constant regardless of the direction of c. Arora et al. prove this rigorously using concentration inequalities. The intuition: if words are uniformly spread in all directions, then for any direction c, roughly the same number of words have positive dot product with c and roughly the same number have negative. So Z_c ≈ Z for a constant Z.

More precisely, by the law of large numbers applied to the sum Z_c = Σ_w exp(c · v_w), when the vocabulary is large and the word vectors are isotropic, the random variables exp(c · v_w) for different w are approximately independent with similar moments. The sum concentrates around its expectation regardless of c's direction.

This is the most technical step in the paper, and it's where the isotropy assumption does its heavy lifting. Without it, Z_c would vary wildly with c, and we couldn't pull it out of expectations.

The Bayesian perspective

There's an elegant Bayesian way to see the whole derivation. We're computing the co-occurrence probability Pr[w, w'] by marginalizing over the latent discourse vector c. This is a standard latent variable model computation:

Pr[w, w'] = ∫ Pr[w|c] Pr[w'|c] Pr[c] dc

The log-linear emission model makes this integral tractable (it becomes a moment generating function of a Gaussian). The isotropy condition ensures the partition function doesn't blow up the computation.

The PMI then follows from the definition: PMI = log(joint/product of marginals). The marginals have the same structure as the joint but with a single word instead of a pair. When you take the ratio, the partition functions cancel (just like in DPO!), and you're left with the clean dot product result.

A pattern in ML theory: The Z-cancellation trick appears in many areas: DPO eliminates Z(x) from the Bradley-Terry model. Arora et al. eliminate Z_c from the emission model. Contrastive learning eliminates the partition function from energy-based models. The pattern: when you compute ratios or differences, normalizing constants often cancel, revealing clean structure underneath.

This simplifies enormously:

Pr[w, w'] ≈ (1/Z²) · E_c[exp(c · (v_w + v_w'))]

Step 4: Computing the expectation

If c is drawn from a spherical Gaussian with covariance σ²I, then c · (v_w + v_w') is Gaussian with mean 0 and variance σ²||v_w + v_w'||². The moment generating function gives:

E_c[exp(c · (v_w + v_w'))] = exp(σ² ||v_w + v_w'||² / 2)

Step 5: Computing PMI

Similarly, the marginal probability:

Pr[w] ≈ (1/Z) · exp(σ² ||v_w||² / 2)

Now compute PMI:

PMI(w, w') = log(Pr[w, w'] / (Pr[w] · Pr[w']))

= log(exp(σ²||v_w + v_w'||²/2) / (exp(σ²||v_w||²/2) · exp(σ²||v_w'||²/2)))

= σ²/2 · (||v_w + v_w'||² − ||v_w||² − ||v_w'||²)

Expanding ||v_w + v_w'||² = ||v_w||² + 2v_w · v_w' + ||v_w'||²:

PMI(w, w') = σ²/2 · (||v_w||² + 2v_w · v_w' + ||v_w'||² − ||v_w||² − ||v_w'||²)

= σ²/2 · 2v_w · v_w' = σ² · v_w · v_w'

The ||v_w||² terms cancel because they appear in both the joint and the marginals. What survives is purely the interaction term — the dot product.

What happens when isotropy fails

If word vectors are not isotropic — say they cluster along a single direction u — then the partition function Z_c depends heavily on c · u. When c points along u, Z_c is large (many words have high dot product). When c is perpendicular to u, Z_c is small. This direction-dependent Z_c introduces a correction term in the PMI:

PMI(w, w') ≈ σ² v_w · v_w' + correction(w, w')

The correction depends on how much the word vectors deviate from isotropy. In practice, this correction is small enough (typically < 15% of the main term) that the linear relationship between PMI and dot product still holds, just with more scatter around the line.

Connection to Levy and Goldberg

Recall from the Levy-Goldberg result that Skip-gram optimizes W · C^T ≈ PMI − log(k). Combining with Arora's result:

W · C^T ≈ PMI − log(k) ≈ σ² v_w · v_c − log(k)

This means the Skip-gram embedding's word vectors are approximately proportional to the "true" word vectors v_w from the generative model, scaled by σ. The shift −log(k) acts as a threshold that zeros out weak associations. The three results — Arora's generative model, Levy-Goldberg's factorization result, and the Skip-gram algorithm — form a complete chain of explanation from language generation to learned vectors.

The result: PMI(w, w') ≈ σ² · v_w · v_w'. The PMI between two words is proportional to the dot product of their word vectors! This is exactly what Skip-gram computes. The random walk model explains why the PMI-dot-product connection exists: it's a consequence of how language generates co-occurrence patterns through coherent discourse.

PMI = Dot Product: The Derivation

Adjust σ (discourse spread) and watch how it scales the relationship between dot products and PMI. Each dot is a word pair.

σ²0.50

Under the random walk model, what is the relationship between PMI(w, w') and the word vectors v_w, v_w'?

PMI(w, w') ≈ σ² · v_w · v_w' — PMI is proportional to the dot product of the word vectors, with the constant depending on the discourse random walk's variance PMI(w, w') = cosine(v_w, v_w') PMI(w, w') = ||v_w − v_w'||

Chapter 3: Why Analogies Are Linear

This is perhaps the most famous property of word embeddings: king − man + woman ≈ queen. Arora et al. provide the first rigorous explanation for why this linear structure emerges.

The setup

An analogy a : b :: c : d expresses a relationship: the relationship between a and b is the same as between c and d. In vector form, we want:

v_a − v_b ≈ v_c − v_d

or equivalently:

v_b − v_a + v_c ≈ v_d

The insight: relationships as shared log-probabilities

Consider the analogy king:queen :: man:woman. The "relationship" between king and queen is the gender dimension — they differ primarily in gender. The relationship between man and woman is also the gender dimension.

Under the random walk model, the word probability is Pr[w | c] ∝ exp(c · v_w). So:

log Pr[king | c] − log Pr[queen | c] = c · (v_king − v_queen)

This difference is a linear function of c. If the same difference applies to man/woman:

c · (v_king − v_queen) ≈ c · (v_man − v_woman)

Since this must hold for all discourse vectors c (i.e., in all contexts), we need:

v_king − v_queen ≈ v_man − v_woman

Let's verify this intuition with a concrete example. Imagine two contexts: c₁ = "medieval royalty" discourse and c₂ = "modern society" discourse. In context c₁, both "king" and "queen" are likely (high dot products with c₁), but "king" might be slightly more likely. In context c₂, both "man" and "woman" are likely. The gender difference — the log-probability gap between male and female terms — should be consistent across contexts. This consistency forces the vectors to encode the relationship as a fixed direction.

More formally, the key equation is:

c · (v_king − v_queen) = c · (v_man − v_woman) for all c

The only way a linear function of c can equal another linear function of c for all c is if the coefficients are equal. Hence v_king − v_queen = v_man − v_woman.

The formal argument

Arora et al. formalize this more carefully. They show that if a relationship R maps word a to word b, then the "direction" of R in embedding space is:

μ_R = E[v_b − v_a | (a, b) ∈ R]

The average difference vector across all word pairs sharing relationship R. Under the isotropy assumption and the random walk model, this direction is approximately the same for all instances of R. So king − queen, man − woman, prince − princess all point in approximately the same direction: the "gender" direction.

Why "approximately"? The vectors are not exactly parallel because words carry multiple meanings and relationships. "King" is not only male-royalty but also a surname, a chess piece, etc. These additional senses add noise. But the dominant relationship creates a strong enough signal that the analogy approximately holds.

The formal theorem

Arora et al. state this as a theorem (informally): if a relationship R induces a consistent log-probability ratio across contexts, and the word vectors satisfy isotropy, then the average difference vector μ_R = E[v_b − v_a] over all instances (a, b) of R captures the relationship as a linear direction.

The error bound on individual analogies depends on:

Consistency of the relationship: How much do individual (a, b) pairs deviate from the average direction μ_R?
Vector noise: How much noise is in the learned vectors (depends on training data size)?
Isotropy quality: How well does the isotropy assumption hold?

In practice, the most consistent relationships (gender, country-capital, tense) have the best analogy accuracy, and the least consistent ones (function, location-type) fail more often. The theory predicts this ordering correctly.

When analogies fail

A worked example

Let's trace through the full argument for the analogy man:king :: woman:queen. The "royalty" relationship maps common person terms to royal person terms.

In any discourse context c where both "man" and "king" are relevant topics (say, a paragraph about medieval society), the log-probability ratio is:

log Pr[king|c] − log Pr[man|c] = c · v_king − c · v_man = c · (v_king − v_man)

This ratio measures "how much more does this context prefer 'king' over 'man'?" In contexts about royalty, this ratio is positive. In contexts about everyday life, it's negative. The direction (v_king − v_man) captures the "royalty" dimension.

The same logic applies to queen/woman: c · (v_queen − v_woman) should give the same ratio in every context c. Since this must hold for ALL contexts c, the two directions must be equal: v_king − v_man = v_queen − v_woman. QED.

The theory also predicts when analogies should fail:

Non-linear relationships: If the relationship can't be captured by a single direction (e.g., "doctor:hospital" — the workplace relationship involves many dimensions), the linear approximation breaks down.
Rare words: Rare words have noisy vectors (insufficient training data), so their difference vectors are unreliable.
Polysemy: If either word has multiple meanings, the vector is a blend of senses, and the difference doesn't isolate any single relationship.

Analogies as Parallel Difference Vectors

2D projection of word vectors. Drag words to see how the difference vectors (arrows) stay parallel for pairs that share a relationship. Add noise to see when analogies break down.

Noise level0.00

Under the random walk model, why does the analogy v_king − v_queen ≈ v_man − v_woman hold?

The log-probability ratio between related word pairs is a linear function of the discourse vector. Since king/queen and man/woman differ by the same relationship (gender), their log-prob ratios must agree across all contexts, forcing the difference vectors to be approximately parallel Word2Vec explicitly trains for analogy structure It's a coincidence of the English language

Chapter 4: The Isotropy Assumption

The most important (and most debatable) assumption in the paper is isotropy: that word vectors are spread uniformly in all directions. Let's unpack what this means, why it's needed, and how well it holds.

What isotropy means

Mathematically, the set of word vectors {v_w}_{w ∈ V} satisfies isotropy if:

(1/n) · Σ_w v_w v_w^T ≈ (s²/d) · I

Where n is the vocabulary size, s² is a scale factor, and I is the identity matrix. This says the average outer product of word vectors is proportional to the identity — no direction is preferred.

In 2D, isotropy means the word vectors don't all point up-and-to-the-right. They fan out evenly in all directions. In 300D, it means no 1D or 2D subspace captures a disproportionate fraction of the total variance.

Why isotropy is needed

The derivation in Chapter 2 relied on the partition function Z_c being approximately constant for all discourse vectors c. This is only true when word vectors are isotropic. If all word vectors point in one direction, then Z_c would be huge when c points that way and tiny when c points the opposite way. The whole derivation breaks.

More specifically, isotropy enables two key simplifications:

Constant partition function: Z_c ≈ Z for all c, which allowed us to factor it out of expectations.
Clean PMI derivation: The expectation E_c[exp(c · v)] behaves like a moment generating function with a clean closed form.

Does isotropy hold in practice?

Partially. Empirical studies show that trained word vectors are not perfectly isotropic:

Top singular values are dominant: The first few principal components capture a disproportionate amount of variance (not isotropic).
Mean vector is not zero: The average word vector is biased, not centered at the origin.
Frequency effects: Frequent words tend to have larger norms and cluster together, breaking isotropy.

However, isotropy holds approximately after removing the top few principal components and centering. This is consistent with the theory: the dominant components capture frequency information (which is not semantic), while the remaining dimensions are approximately isotropic and carry the semantic content.

Measuring isotropy

How do you quantify how isotropic a set of vectors is? One measure is the condition number of the covariance matrix C = (1/n) Σ v_wv_w^T. For perfectly isotropic vectors, C is proportional to the identity, so all eigenvalues are equal and the condition number is 1. For highly anisotropic vectors, the top eigenvalue dominates and the condition number is large.

Empirical measurements on trained Word2Vec vectors (300 dimensions, 100K vocabulary):

Raw vectors: Condition number ≈ 50–100. Not very isotropic.
After centering (subtract mean): Condition number drops to ≈ 20–40.
After removing top 3 PCs: Condition number drops to ≈ 5–10. Much more isotropic.
Within frequency bands: Condition number ≈ 3–5. Nearly isotropic.

This progression confirms the theory's predictions: the anisotropy is primarily driven by frequency effects (captured by the mean and top PCs), and the remaining structure is approximately isotropic and carries the semantic content.

python
# Measuring isotropy of word vectors
import numpy as np

def measure_isotropy(vectors):
    """Compute condition number of covariance matrix."""
    # Center the vectors
    centered = vectors - vectors.mean(axis=0)
    # Covariance matrix
    cov = centered.T @ centered / len(vectors)
    # Eigenvalues
    eigvals = np.linalg.eigvalsh(cov)
    condition = eigvals[-1] / eigvals[0]
    return condition

def remove_top_pcs(vectors, k=3):
    """Remove top-k principal components (All-but-the-Top)."""
    centered = vectors - vectors.mean(axis=0)
    U, S, Vt = np.linalg.svd(centered, full_matrices=False)
    # Remove projections onto top-k directions
    for i in range(k):
        direction = Vt[i]
        centered -= np.outer(centered @ direction, direction)
    return centered

Post-processing implication: This analysis suggests a simple improvement to word vectors: center them (subtract the mean) and remove the top few principal components. This was later formalized by Mu and Viswanath (2018) as "All-but-the-Top" debiasing, which consistently improves word vectors on similarity benchmarks. The theory predicts exactly this: removing the anisotropic components makes the remaining space more isotropic, improving the PMI approximation.

Isotropy Visualization

Isotropic vectors (left) vs. anisotropic vectors (right). Adjust the anisotropy to see how clustering in one direction breaks the partition function constancy assumption.

Anisotropy0.00

Why is the isotropy assumption necessary for the PMI derivation, and does it hold in practice?

Isotropy ensures the partition function Z_c is approximately constant for all discourse vectors c, which is needed to factor it out. It holds approximately in practice after removing the top few principal components and centering, motivating "All-but-the-Top" debiasing Isotropy makes training faster Isotropy is not needed; it's just a simplification for proof elegance

Chapter 5: Frequency and Norm

The random walk model makes a specific prediction about the relationship between a word's frequency and its vector norm. This prediction is testable and turns out to be remarkably accurate.

The prediction

From the marginal probability derived in Chapter 2:

Pr[w] ≈ (1/Z) · exp(σ² ||v_w||² / 2)

Taking the log:

log Pr[w] ≈ σ² ||v_w||² / 2 − log Z

Rearranging:

||v_w||² ≈ (2/σ²) · (log Pr[w] + log Z)

The frequency-norm law: A word's vector norm squared is proportional to the log of its frequency. Frequent words have larger norms. This is a quantitative prediction: plot ||v_w||² vs. log(frequency), and you should see a line. And indeed, empirical measurements on trained Word2Vec vectors confirm this almost exactly.

Why this matters

This prediction connects a purely geometric property (vector norm) to a purely statistical property (word frequency). The fact that it holds empirically is strong evidence that the random walk model captures something real about the data-generating process.

It also explains a known phenomenon: frequent words dominate similarity computations. Because ||v_"the"|| >> ||v_"quasar"||, the dot product between "the" and anything is large simply because "the" has a large vector. This is why cosine similarity (which normalizes by norms) works better than dot products for word similarity.

Implications for cosine similarity

This frequency-norm relationship explains why cosine similarity works better than dot products for word similarity tasks. The dot product v_w · v_w' is contaminated by the norm effect: high-frequency words have large norms, so their dot products with everything are large. Cosine similarity divides by the norms, effectively removing the frequency component and isolating the directional (semantic) component.

In formula terms: if ||v_w||² ≈ (2/σ²) log Pr[w], then:

v_w · v_w' = ||v_w|| · ||v_w'|| · cos(θ) ≈ (2/σ²) · √(log Pr[w] · log Pr[w']) · cos(θ)

The dot product mixes frequency (the norm product) and semantics (the angle). Cosine similarity strips the frequency part, giving you pure directional similarity. This is exactly what you want for semantic comparison.

Frequency bands

The paper also shows that within narrow frequency bands (words of similar frequency), the isotropy assumption holds much better. This makes sense: the norm variation is mainly driven by frequency, and within a frequency band, norms are similar, so the remaining variation is more isotropic.

This suggests a practical heuristic: when comparing word vectors, normalize by frequency band. Or even simpler: use cosine similarity (which already handles the norm issue) and center the vectors (which removes the mean direction, another frequency artifact).

Verifying on real data

Let's walk through a concrete verification. Take a trained Word2Vec model (Google News, 300d) and check the frequency-norm prediction:

Word	Frequency rank	\|\|v\|\|	\|\|v\|\|²	log(freq)
the	1	4.2	17.6	18.4
good	~200	3.8	14.4	13.1
algorithm	~5K	3.3	10.9	9.8
pulsar	~50K	2.1	4.4	7.5
quasar	~80K	1.8	3.2	6.9

The trend is clear: more frequent words have larger norms, and the relationship between ||v||² and log(freq) is approximately linear, exactly as the theory predicts. The slope (2/σ²) depends on the training corpus and hyperparameters, but the linearity holds robustly.

Practical takeaway: If you're training word embeddings and want to check if they've converged properly, plot norm squared vs. log frequency. A clean linear relationship indicates healthy embeddings. Deviations (e.g., some rare words with unexpectedly large norms) indicate training instability or data quality issues.

python
import numpy as np
import matplotlib.pyplot as plt

# Load pretrained Word2Vec vectors and word frequencies
vectors = load_word2vec('GoogleNews-vectors.bin')
freqs = load_frequencies('corpus_counts.txt')

# Compute norms squared and log frequencies
words = [w for w in freqs if w in vectors]
norms_sq = np.array([np.linalg.norm(vectors[w])**2 for w in words])
log_freq = np.log(np.array([freqs[w] for w in words]))

# Fit linear regression: ||v||^2 ≈ a * log(freq) + b
coeffs = np.polyfit(log_freq, norms_sq, 1)
# Typical result: R^2 ≈ 0.85-0.90, confirming the theory

plt.scatter(log_freq, norms_sq, alpha=0.1, s=1)
plt.plot(log_freq, np.polyval(coeffs, log_freq), 'r-')
plt.xlabel('log(frequency)')
plt.ylabel('||v_w||^2')
plt.title('Frequency-Norm Law (Arora et al.)')

Frequency-Norm Law

Simulated scatter plot of log(word frequency) vs. vector norm squared. The linear relationship predicted by the theory appears clearly. Adjust σ² to see how the discourse variance affects the slope.

σ²0.50

What does the random walk model predict about the relationship between word frequency and vector norm?

||v_w||² is proportional to log Pr[w] — a word's vector norm squared scales linearly with the log of its frequency, with the slope determined by the discourse random walk variance σ² Frequent words have smaller norms Norm and frequency are unrelated

Chapter 6: The Simulation

Let's put the entire theory together. The random walk model makes three testable predictions: (1) PMI ≈ dot product, (2) linear analogies emerge, (3) norm ∝ log frequency. We can simulate all three in a single interactive environment.

Full Random Walk Simulation

A complete generative model. Set the vocabulary size and discourse variance, then generate a synthetic corpus and see if the three predictions hold. Click "Generate Corpus" to run.

Vocab size50

σ² (walk speed)0.20

Corpus length5K

What the simulation shows

PMI ≈ σ² · v · v': The scatter plot of empirical PMI (computed from the generated corpus) vs. dot product of the true vectors is tightly linear. The slope equals σ², exactly as predicted.
Frequency ∝ exp(||v||²): Words with large norms appear more frequently, and the log-frequency vs. norm-squared plot is linear.
Analogies: When you deliberately place word pairs with parallel difference vectors, the analogy test succeeds on the generated corpus.

The theory works. The random walk model isn't just a mathematical convenience — it produces synthetic text with the same statistical properties that make Word2Vec work on real text. This is strong evidence that discourse coherence (slow random walk through semantic space) is the mechanism behind meaningful word embeddings.

Quantitative validation

The paper validates the theory on real Word2Vec vectors trained on Wikipedia. The key measurements:

Prediction	Measured	Theory
PMI vs. dot product	R² ≈ 0.85	Linear with slope σ²
Norm vs. log frequency	R² ≈ 0.88	Linear with slope 2/σ²
Partition function Z_c	CV ≈ 0.08	≈ constant (CV should be small)
Analogy direction consistency	cos(diff vectors) ≈ 0.75	High for clean relationships

The R² values of 0.85–0.88 are remarkably high for such a simple theory. The remaining 12–15% of variance is attributed to finite corpus effects, polysemy, and deviations from isotropy.

What the simulation teaches

Play with the simulation above and notice:

Large σ² (fast walk) → discourse jumps between topics, co-occurrences are noisier, PMI estimates are less reliable.
Small σ² (slow walk) → discourse stays on-topic for long stretches, co-occurrences are concentrated, PMI estimates are cleaner but cover fewer word pairs.
Large vocabulary → more pairs to estimate PMI from, better conformance to theory predictions.
Longer corpus → more accurate empirical PMI, tighter scatter around the theoretical line.

What is the strongest empirical evidence that the random walk model captures the right mechanism behind word embeddings?

Synthetic text generated from the model reproduces all three key properties of trained Word2Vec vectors: PMI equals dot product (scaled by σ²), linear analogies emerge, and norm scales with log frequency — matching quantitative predictions The model can be trained faster than Word2Vec The model uses fewer parameters

Chapter 7: Connections

What this paper built on

Word2Vec (Mikolov et al., 2013): The algorithm whose success demands theoretical explanation. Arora et al. provide the first generative model that explains its key properties.

Neural Word Embedding as Implicit Matrix Factorization (Levy & Goldberg, 2014): Proved W · C^T = PMI − log(k). Arora et al. explain why this factorization produces meaningful vectors.

GloVe (Pennington et al., 2014): An explicit matrix factorization approach. Arora et al.'s theory applies to GloVe too, since it also optimizes a function of co-occurrence statistics.

Latent Semantic Analysis (Deerwester et al., 1990): SVD on co-occurrence matrices. The random walk model retroactively explains why LSA works: it's factorizing a matrix that approximates PMI.

What this paper enabled

All-but-the-Top (Mu & Viswanath, 2018): Post-processing word vectors by removing top principal components. Directly motivated by the isotropy analysis: removing anisotropic components makes the PMI approximation tighter.

SIF embeddings (Arora et al., 2017): The same authors extended their theory to sentence embeddings. A simple weighted average of word vectors, with the first principal component removed, gives surprisingly strong sentence representations. This follows from the random walk model: the discourse vector at any point is well-approximated by the weighted average of nearby word vectors.

Contextual embeddings understanding: The discourse vector concept foreshadowed contextual embeddings (ELMo, BERT). The discourse vector is essentially a context-dependent representation — BERT just makes it explicit and learnable.

Evaluation with Arora's framework (Schnabel et al., 2015; Levy et al., 2015): The three papers together form a complete picture. Arora explains why word vectors work. Levy et al. explains what design choices matter. Schnabel et al. explains how to evaluate whether they worked. Reading them together gives a comprehensive understanding of the word embedding era.

Decontextualized word embeddings: Even in the BERT era, the Arora model remains relevant. Methods like Static BERT (averaging BERT's contextual representations across contexts to get a static vector) can be understood through the random walk lens: the average contextual representation is an estimate of the expected discourse vector weighted by the word's usage distribution.

Open questions

Despite the elegance of the theory, several questions remain open:

Compositionality: How does the model extend to phrases and sentences beyond simple weighted averaging? The SIF method works for bag-of-words approximation, but order-dependent meanings (e.g., "dog bites man" vs. "man bites dog") require additional structure.
Non-isotropic settings: Can the theory be generalized to handle the known anisotropy of real word vectors without requiring post-processing? Some recent work suggests using non-Euclidean geometries (hyperbolic spaces) that naturally accommodate frequency-driven norms.
Scaling: Does the model explain why larger corpora produce better embeddings? The random walk model predicts that more data means better estimates of PMI, which means better approximation of the true dot products. But the quantitative relationship (how much data is "enough"?) is not precisely characterized.
Multimodal extension: Can the random walk model be extended to explain visual-language embeddings (CLIP)? Images and text share co-occurrence patterns — perhaps a multimodal discourse vector walks through both visual and textual space.

These questions represent active research directions that build on the foundation this paper established. The random walk model may not be the final word on why embeddings work, but it's the first word — and it set the stage for everything that followed.

The three papers as a trilogy

This paper is best understood alongside two contemporaries:

Paper	Year	Key Question	Answer
Levy & Goldberg	2014	What does Skip-gram compute?	A shifted PMI matrix factorization
Levy, Goldberg & Dagan	2015	Why do neural methods win?	They don't — it's hyperparameters
Arora et al.	2016	Why does PMI capture meaning?	Discourse coherence creates a low-rank PMI structure

Together, these three papers form a complete theoretical understanding of the word embedding era. Levy & Goldberg showed the what (PMI factorization). Levy et al. showed the how (hyperparameters matter most). Arora et al. showed the why (discourse coherence makes it work). By 2016, word embeddings were no longer magic — they were a well-understood technology with clear theoretical foundations.

The lasting legacy

The paper's random walk framework established a template for theoretical analysis in NLP:

Propose a simple generative model for text
Derive what an algorithm computes under that model
Verify predictions empirically
Use the theory to improve practical methods

This template has been applied to topic models, sentence embeddings, attention mechanisms, and even transformer representations. The specific model (random walk) may be superseded, but the methodology of explaining "why algorithms work" through generative modeling remains a cornerstone of NLP theory.

Practical takeaways for practitioners

Even if you never train word embeddings from scratch (most practitioners use pretrained models), the random walk theory offers practical guidance:

Use cosine similarity, not dot products. The frequency-norm law means dot products are contaminated by word frequency. Cosine similarity removes this confound.
Center your vectors and remove top PCs. The "All-but-the-Top" trick (subtract mean, remove top 3 PCs) consistently improves similarity performance by making the vectors more isotropic.
Understand window size effects. Your pretrained embeddings were trained with a specific window size, which determines whether they capture substitutability (small L) or topical relatedness (large L). Choose accordingly.
Don't over-interpret analogies. Linear analogies emerge from a specific property of the generative process (consistent log-probability ratios). Not all semantic relationships satisfy this property, so analogy failures don't mean the embeddings are bad.
Check the frequency-norm diagnostic. If you're training custom embeddings, plot ||v||² vs. log(frequency). A clean linear relationship indicates healthy training. Outliers may indicate data quality issues.

The deepest lesson: Understanding why a method works — not just that it works — enables principled improvements. Without the random walk theory, "All-but-the-Top" debiasing, SIF sentence embeddings, and polysemy decomposition would have been impossible to derive. Theory isn't a luxury; it's a tool for engineering better systems.

The random walk model as a design principle

The model suggests a design principle for any representation learning system: if you want representations with clean algebraic structure, design your training objective to factorize a PMI-like matrix. This principle has been applied beyond word embeddings:

Graph embeddings: Node2Vec and DeepWalk use random walks on graphs. The resulting node vectors have PMI-like properties with respect to graph co-occurrence. The Arora model explains why this works: graph random walks create co-occurrence patterns whose PMI is captured by dot products.
Protein embeddings: ProtVec treats protein sequences as "sentences" and amino acids as "words." The random walk model explains why this transfer works: protein sequences have local coherence (nearby amino acids interact physically), creating PMI structure analogous to natural language.
Code embeddings: Code2Vec treats AST paths as "contexts." Again, local coherence in code (variables used near their declarations, related functions called together) creates the PMI structure that makes embeddings meaningful.

In every case, the key ingredient is local coherence: nearby elements are related, and this relatedness creates co-occurrence statistics whose PMI is low-rank. The random walk model identifies this as the universal mechanism behind meaningful embeddings, regardless of the domain.

Connection to the attention mechanism

There is a deep but underappreciated connection between the random walk model and the attention mechanism in transformers. In the random walk model, the discourse vector c_t determines which words are likely. In a transformer's attention mechanism, the query vector q determines which keys (and therefore values) are attended to.

The log-linear emission Pr[w|c] ∝ exp(c · v_w) is structurally identical to the attention weight: a(q, k) ∝ exp(q · k / √d). Both use a softmax over dot products. The discourse vector plays the role of the query; word vectors play the role of keys.

This analogy suggests that transformers are, in a sense, learning the discourse vector at each position. The self-attention mechanism computes a context-dependent representation (analogous to c_t) that determines which tokens are relevant (analogous to which words are likely). The random walk model is the simplest version of this computation — a stochastic, unlearned version of attention.

What the theory does NOT explain

For intellectual honesty, let's be clear about the theory's boundaries:

Why specific words are similar: The theory explains that co-occurring words will have high dot products, but not which words co-occur. The semantic content comes from the corpus, not the model.
Why embeddings transfer: Word vectors trained on Wikipedia work for tweets. The theory says nothing about why statistical patterns from one domain transfer to another.
Compositionality: "Hot dog" does not mean a warm canine. The theory only models word-level statistics, not compositional semantics.
Training dynamics: The theory describes the equilibrium (what Skip-gram converges to) but not the process (how SGD gets there, whether it gets stuck, how many epochs are needed).

These are important limitations. The random walk model explains the population-level statistics of word co-occurrence. It does not explain the sample-level behavior of training algorithms, the compositional structure of phrases, or the transfer properties of learned representations. These remain open theoretical questions.

Closing thought

Before Arora et al., Word2Vec was celebrated as a triumph of engineering over theory. The algorithm worked spectacularly, but nobody could explain why king − man + woman gave queen. The random walk model showed that this "unreasonable effectiveness" was, in fact, perfectly reasonable: it's a natural consequence of discourse coherence and log-linear word emission.

The paper's true significance is meta-scientific. It demonstrated that even for seemingly mysterious ML phenomena, principled theoretical analysis is possible and fruitful. Theory doesn't just explain — it generates new methods (SIF embeddings, All-but-the-Top), new diagnostics (frequency-norm plots), and new understanding (polysemy as vector averaging). In a field increasingly dominated by empirical scaling, this paper stands as a reminder that understanding why something works is the foundation for making it work better.

Summary: the theory in one breath

Text is generated by a slowly-walking discourse vector. Words near the current discourse are likely to be emitted. This creates co-occurrence patterns where PMI equals the dot product of word vectors (scaled by the walk's variance). Frequent words have large vector norms because they're emitted from many discourse positions. Analogies are linear because consistent relationships induce consistent log-probability ratios across all discourse contexts. And 300 dimensions suffice because discourse only has about 300 independent topical directions. One model, four predictions, all confirmed empirically. That's the power of theory.

Limitations

Isotropy assumption: Real word vectors are not perfectly isotropic. The theory holds approximately, but deviations are significant. The top 3–5 principal components capture disproportionate variance, violating isotropy.
Polysemy: The model assigns one vector per word. It doesn't explain how a single vector captures multiple senses of polysemous words. Arora et al. address this in a follow-up paper (2018), showing that polysemous word vectors are approximately the weighted average of their sense vectors, with weights proportional to sense frequency.
Syntax vs. semantics: The model is primarily semantic (discourse coherence is a semantic property). It doesn't explain why syntactic relationships (tense, number) also form linear directions. Syntactic regularities likely require a different theoretical mechanism.
Context independence: The model produces static word vectors — one vector per word type. It cannot explain contextual embeddings (ELMo, BERT) where each word token gets a different vector depending on context. However, the discourse vector c_t can be seen as a proto-contextual representation.

Extensions by the same authors

SIF Sentence Embeddings (Arora et al., ICLR 2017): The random walk model implies that the discourse vector at any point is well-approximated by a weighted average of nearby word vectors. This leads to a simple sentence embedding method:

python
# SIF sentence embeddings (Arora et al., 2017)
# Derived directly from the random walk model

def sif_embedding(sentences, word_vectors, word_freqs, a=1e-3):
    """Compute sentence embeddings via Smooth Inverse Frequency."""
    embeddings = []
    for sent in sentences:
        # Step 1: Weighted average (downweight frequent words)
        vecs = []
        for w in sent:
            if w in word_vectors:
                weight = a / (a + word_freqs.get(w, 1e-6))
                vecs.append(weight * word_vectors[w])
        embeddings.append(np.mean(vecs, axis=0))

    embeddings = np.array(embeddings)

    # Step 2: Remove first principal component
    # (removes common discourse direction = background topic)
    u, s, vt = np.linalg.svd(embeddings, full_matrices=False)
    pc1 = vt[0]
    embeddings -= np.outer(embeddings @ pc1, pc1)

    return embeddings

This method achieves surprisingly strong performance on sentence similarity tasks, outperforming many learned methods.

The connection to the random walk model is direct: the discourse vector c_t generates the words in a sentence. By Bayes' rule, we can approximately recover c_t from the observed words. The weighted average with downweighting of frequent words (which carry little information about c_t) is the maximum likelihood estimate. Removing the first PC corresponds to removing the common discourse direction (the "background topic"), isolating sentence-specific meaning.

Why word vectors are low-dimensional

The random walk model also explains why 300 dimensions suffice for a 100K-word vocabulary. The discourse vector c_t lives in d-dimensional space, and all word probabilities are determined by dot products with c_t. If the discourse vector only explores a d-dimensional subspace (d << V), then only d dimensions of the word vectors matter for predicting co-occurrence.

How large is d in practice? It's determined by the "effective dimensionality" of discourse — how many independent topical directions exist. In a corpus covering politics, science, sports, technology, arts, etc., there might be 300–500 roughly independent topical axes. More dimensions than this capture noise rather than signal.

This also explains why increasing d beyond 500 rarely helps: you've already captured all the independent directions of discourse variation. Additional dimensions model corpus-specific noise that doesn't generalize.

The information-theoretic perspective

There's a deeper reason why low dimensionality suffices. The mutual information between a word and its context is bounded:

I(W; C) = Σ_w,c P(w,c) log(P(w,c) / (P(w)P(c))) = Σ_w,c P(w,c) · PMI(w,c)

This mutual information is finite and relatively small (a few bits for typical corpora). Since PMI ≈ σ² v_w · v_c, the mutual information constrains the total "spread" of the word vectors. You only need enough dimensions to capture this finite mutual information. Beyond that, additional dimensions would encode zero-information noise.

Empirically, I(W;C) is typically 2–4 bits for natural language with window size 5. With d = 300 dimensions, each dimension carries about 0.01 bits of mutual information — a vanishingly small amount per dimension, but enough in aggregate to capture the essential structure.

The compression ratio is staggering: A 100K × 100K PMI matrix has 10 billion entries. A 100K × 300 embedding matrix has 30 million entries — a 333x compression. Yet the compressed representation preserves the essential co-occurrence structure. The random walk model explains why: the 10 billion PMI values are generated by only 100K × 300 + 300 parameters (word vectors + discourse variance), so the true dimensionality of the data is much lower than it appears.

Relationship to other theoretical work

The complete theoretical chain

Let's trace the full chain from language to learned vectors. This synthesis connects this paper to Levy & Goldberg (2014) and Levy, Goldberg & Dagan (2015):

Language generation: Discourse vector c_t random walks through semantic space. Words emitted via Pr[w|c] ∝ exp(c · v_w).
Co-occurrence statistics: The random walk creates co-occurrence patterns where PMI(w,w') ≈ σ² v_w · v_w'. (Arora et al.)
What Skip-gram optimizes: SGNS implicitly factorizes W · C^T = PMI − log(k). (Levy & Goldberg, 2014)
What matters for quality: Hyperparameters (subsampling, CDS, window type) affect the PMI matrix more than the factorization method. (Levy, Goldberg & Dagan, 2015)
Why it all works: The resulting vectors capture the essential co-occurrence structure because discourse coherence creates a low-rank PMI matrix. (Arora et al.)

The random walk model is not the only theory for word embeddings, but it's the most complete. Other theoretical perspectives include:

Matrix factorization view (Levy & Goldberg, 2014): Skip-gram factorizes PMI. But this doesn't explain why PMI dot products are meaningful.
Distributional hypothesis (Harris, 1954): Words in similar contexts have similar meanings. True but vague — doesn't make quantitative predictions.
Noise contrastive estimation (Gutmann & Hyvarinen, 2010): NCE explains the training objective but not the geometric structure of the learned space.

Arora et al.'s model is the first to derive the PMI connection, linear analogies, frequency-norm law, and low dimensionality from a single generative process. It unifies the empirical observations into a coherent theoretical framework.

The polysemy extension

In a follow-up work, Arora et al. (2018) extended the model to handle polysemous words — words with multiple meanings. The key result: a polysemous word's vector is approximately the weighted average of its sense vectors, weighted by sense frequency.

v_bank ≈ p₁ · v_{bank(financial)} + p₂ · v_bank(river)

where p₁ and p₂ are the relative frequencies of the two senses. This explains a puzzling empirical observation: "bank" is somewhat near both "money" and "river," not fully in either cluster. The vector is a compromise weighted by how often each sense appears in the training corpus.

The paper also shows that you can recover individual sense vectors from the composite vector using sparse coding — decompose the word vector as a sparse combination of basis "atom" vectors that represent pure senses. This provides a principled method for word sense disambiguation using only word vectors.

Why the theory works better than expected

The random walk model is extremely simplified. Real language has: hierarchical syntax (not captured), long-range dependencies (not captured), pragmatic implicature (not captured), metaphor (not captured). Yet the model's predictions are remarkably accurate. Why?

The answer is that word co-occurrence statistics are lossy projections of the full linguistic structure. When you compute PMI from a window of size 5, all syntactic structure within that window is collapsed into a single count. The random walk model captures exactly this collapsed, context-window-level statistics. It's not a model of language — it's a model of what co-occurrence statistics see of language. And for that purpose, it's excellent.

From discourse vectors to contextual embeddings

The discourse vector c_t is, in some sense, a precursor to contextual embeddings. Think about it:

In the random walk model, each word is generated by a context-dependent vector c_t.
In ELMo/BERT, each word is represented by a context-dependent vector h_t.
The random walk model uses one fixed vector v_w per word type. ELMo/BERT compute a different vector per token.

The conceptual leap from Arora to BERT: instead of having the discourse vector as a latent variable that generates words, make it an explicit representation that a neural network computes. The discourse vector becomes the hidden state of a transformer. The log-linear emission probability becomes the language model's softmax layer.

The connection in equations:
Random walk: Pr[w_t | c_t] ∝ exp(c_t · v_w)
Language model: Pr[w_t | h_t] = softmax(h_t · W_output)
The structure is identical! The difference is that h_t in a transformer is computed deterministically from the preceding context via attention, while c_t is a random walk. The random walk model is the simplest possible "transformer" — one with no attention, no feedforward layers, and a stochastic hidden state.

Summary of the theoretical contributions

Let's collect the paper's main results in one place:

Theorem 1

Under the random walk model with isotropy, PMI(w, w') ≈ σ² · v_w · v_w'

↓

Corollary 1

Word frequency satisfies log Pr[w] ≈ (σ²/2) ||v_w||² − log Z

↓

Corollary 2

If a relationship R is consistent across contexts, v_a − v_b ≈ v_c − v_d for (a,b), (c,d) ∈ R

↓

Corollary 3

d ≈ I(W;C) / ε² dimensions suffice, where I is mutual information and ε is approximation error

Polysemy: Composite Word Vectors

A polysemous word's vector is the weighted average of its sense vectors. Adjust the sense balance to see how the composite vector moves between the two clusters.

Sense 1 weight0.70

The big picture: Before this paper, Word2Vec was a black box that happened to work. After this paper, we understand why it works: natural language is generated by a process where topics evolve coherently, and this coherence creates co-occurrence patterns whose PMI structure is captured by low-dimensional vector dot products. The mystery is solved — word embeddings work because discourse is coherent.

Cheat sheet

Core model

Text = random walk of discourse vector c_t through embedding space; Pr[w|c] ∝ exp(c · v_w)

PMI result

PMI(w, w') ≈ σ² · v_w · v_w' — dot product = PMI up to a constant

Analogy explanation

Log-prob ratios are linear in c; shared relationships force parallel difference vectors

Frequency law

||v_w||² ∝ log Pr[w] — frequent words have larger vectors

Key assumption

Isotropy: word vectors spread uniformly. Holds approximately after PCA debiasing.

In one sentence, why do word embeddings work according to the Arora et al. theory?

Natural language is generated by a coherently-moving discourse vector, and this coherence creates co-occurrence patterns whose PMI structure is exactly captured by dot products in a low-dimensional vector space Neural networks learn better representations than linear methods Larger corpora produce better embeddings

A Latent Variable Model for Word Embeddings

Chapter 0: The Mystery

Chapter 1: The Random Walk Model

The setup

Assumption 1: Log-linear emission

Assumption 2: Slow random walk

Assumption 3: Isotropy

Why these assumptions are natural

Chapter 2: The PMI Derivation

Step 1: Joint probability

Step 2: Marginalizing over the discourse vector

Step 3: The key approximation (constant partition function)

The Bayesian perspective

Step 4: Computing the expectation

Step 5: Computing PMI

What happens when isotropy fails

Connection to Levy and Goldberg

Chapter 3: Why Analogies Are Linear

The setup

The insight: relationships as shared log-probabilities

The formal argument

The formal theorem

When analogies fail

A worked example

Chapter 4: The Isotropy Assumption

What isotropy means

Why isotropy is needed

Does isotropy hold in practice?

Measuring isotropy

Chapter 5: Frequency and Norm

The prediction

Why this matters

Implications for cosine similarity

Frequency bands

Verifying on real data

Chapter 6: The Simulation

What the simulation shows

Quantitative validation

What the simulation teaches

Chapter 7: Connections

What this paper built on

What this paper enabled

Open questions

The three papers as a trilogy

The lasting legacy

Practical takeaways for practitioners

The random walk model as a design principle

Connection to the attention mechanism

What the theory does NOT explain

Closing thought

Summary: the theory in one breath

Limitations

Extensions by the same authors

Why word vectors are low-dimensional

The information-theoretic perspective

Relationship to other theoretical work

The complete theoretical chain

The polysemy extension

Why the theory works better than expected

From discourse vectors to contextual embeddings

Summary of the theoretical contributions

Cheat sheet