A random walk of a "discourse vector" through embedding space explains why Word2Vec works, why analogies are linear, and why word vectors are low-dimensional.
Word2Vec was a spectacular empirical success. Train it on a big corpus, and remarkable structure emerges from the learned vectors:
But why? Why should a simple prediction objective produce vectors with these properties? Why should word relationships be linear? Why should 300 dimensions be enough? Why should the dot product of two word vectors relate to their PMI?
Arora et al. provide the first rigorous theoretical framework that explains all of these properties simultaneously. Their key idea: model text as a random walk of a discourse vector through the embedding space. Words are emitted based on their proximity to the current discourse vector. This single generative assumption explains everything.
Click each mystery to see what the paper explains. All three emerge from a single generative model.
Here is Arora et al.'s generative model for text. It's elegantly simple.
Imagine a discourse vector ct that represents the "topic" or "context" at time step t. This vector lives in the same d-dimensional space as word vectors. At each time step, a word is emitted based on how close it is to the current discourse vector.
The probability of emitting word w at time t, given discourse vector ct, is:
Where vw is the word vector for w. In words: words whose vectors point in the same direction as the discourse vector are more likely to be emitted. If ct points toward the "royalty" cluster, words like "king," "queen," "throne" become likely.
Normalizing, this becomes:
Where Zc = Σw' exp(ct · vw') is the partition function.
The discourse vector moves as a random walk:
Where ηt is a small random perturbation. The key word is slow: the discourse vector changes gradually. In a paragraph about kings, ct stays near the "royalty" region for many time steps before drifting elsewhere.
This captures a fundamental property of natural language: discourse coherence. Consecutive words in a text are topically related because the discourse vector that generates them is nearly the same.
The word vectors vw are spread uniformly in all directions (isotropic) in the embedding space. No direction is systematically preferred. We'll examine this assumption carefully in Chapter 4.
The log-linear emission model is not arbitrary. It's the maximum entropy distribution subject to the constraint that the expected word vector under P[w|c] matches the discourse vector c. Maximum entropy is the least committal distribution — it makes the fewest assumptions beyond the constraint. So the model is saying: "given the topic c, emit words in the most neutral way possible that's consistent with the topic."
The slow random walk is an empirical observation about discourse. Read any paragraph: the topic doesn't teleport from quantum physics to cooking recipes between consecutive sentences. Topics evolve gradually. The random walk formalizes this continuity.
Together, these three assumptions form a remarkably compact model. Two equations (emission probability and walk dynamics) plus one regularity condition (isotropy) suffice to derive all of Word2Vec's key properties.
python # The generative model in code import numpy as np def generate_text(word_vectors, sigma, n_steps): """Generate text via random walk of discourse vector.""" d = word_vectors.shape[1] # embedding dimension V = word_vectors.shape[0] # vocabulary size c = np.zeros(d) # discourse vector starts at origin corpus = [] for t in range(n_steps): # Log-linear emission: Pr[w|c] ∝ exp(c · v_w) logits = word_vectors @ c logits -= logits.max() # numerical stability probs = np.exp(logits) probs /= probs.sum() w = np.random.choice(V, p=probs) corpus.append(w) # Slow random walk: c_{t+1} = c_t + η_t c += sigma * np.random.randn(d) return corpus
Watch a discourse vector (orange dot) random walk through 2D word space. Words near the current position light up — these are the words likely to be emitted. Click "Step" to advance or "Auto" to watch it walk.
Now we derive the central theoretical result: under the random walk model, the PMI between two words is approximately equal to the dot product of their word vectors. This explains why Word2Vec's Skip-gram objective (which Levy and Goldberg showed implicitly factorizes PMI) produces meaningful vectors.
What is the probability that words w and w' co-occur within a small window? Both are emitted when the discourse vector is nearby. If the discourse vector is c, the joint probability (treating the two positions as approximately having the same c, since the walk is slow) is:
The discourse vector c is a latent variable — we don't observe it. To get the unconditional co-occurrence probability, we average over all possible discourse vectors:
Under the isotropy assumption (word vectors uniformly distributed), the partition function Zc is approximately constant regardless of the direction of c. Arora et al. prove this rigorously using concentration inequalities. The intuition: if words are uniformly spread in all directions, then for any direction c, roughly the same number of words have positive dot product with c and roughly the same number have negative. So Zc ≈ Z for a constant Z.
More precisely, by the law of large numbers applied to the sum Zc = Σw exp(c · vw), when the vocabulary is large and the word vectors are isotropic, the random variables exp(c · vw) for different w are approximately independent with similar moments. The sum concentrates around its expectation regardless of c's direction.
This is the most technical step in the paper, and it's where the isotropy assumption does its heavy lifting. Without it, Zc would vary wildly with c, and we couldn't pull it out of expectations.
There's an elegant Bayesian way to see the whole derivation. We're computing the co-occurrence probability Pr[w, w'] by marginalizing over the latent discourse vector c. This is a standard latent variable model computation:
The log-linear emission model makes this integral tractable (it becomes a moment generating function of a Gaussian). The isotropy condition ensures the partition function doesn't blow up the computation.
The PMI then follows from the definition: PMI = log(joint/product of marginals). The marginals have the same structure as the joint but with a single word instead of a pair. When you take the ratio, the partition functions cancel (just like in DPO!), and you're left with the clean dot product result.
This simplifies enormously:
If c is drawn from a spherical Gaussian with covariance σ2I, then c · (vw + vw') is Gaussian with mean 0 and variance σ2||vw + vw'||2. The moment generating function gives:
Similarly, the marginal probability:
Now compute PMI:
Expanding ||vw + vw'||2 = ||vw||2 + 2vw · vw' + ||vw'||2:
The ||vw||2 terms cancel because they appear in both the joint and the marginals. What survives is purely the interaction term — the dot product.
If word vectors are not isotropic — say they cluster along a single direction u — then the partition function Zc depends heavily on c · u. When c points along u, Zc is large (many words have high dot product). When c is perpendicular to u, Zc is small. This direction-dependent Zc introduces a correction term in the PMI:
The correction depends on how much the word vectors deviate from isotropy. In practice, this correction is small enough (typically < 15% of the main term) that the linear relationship between PMI and dot product still holds, just with more scatter around the line.
Recall from the Levy-Goldberg result that Skip-gram optimizes W · CT ≈ PMI − log(k). Combining with Arora's result:
This means the Skip-gram embedding's word vectors are approximately proportional to the "true" word vectors vw from the generative model, scaled by σ. The shift −log(k) acts as a threshold that zeros out weak associations. The three results — Arora's generative model, Levy-Goldberg's factorization result, and the Skip-gram algorithm — form a complete chain of explanation from language generation to learned vectors.
Adjust σ (discourse spread) and watch how it scales the relationship between dot products and PMI. Each dot is a word pair.
This is perhaps the most famous property of word embeddings: king − man + woman ≈ queen. Arora et al. provide the first rigorous explanation for why this linear structure emerges.
An analogy a : b :: c : d expresses a relationship: the relationship between a and b is the same as between c and d. In vector form, we want:
or equivalently:
Consider the analogy king:queen :: man:woman. The "relationship" between king and queen is the gender dimension — they differ primarily in gender. The relationship between man and woman is also the gender dimension.
Under the random walk model, the word probability is Pr[w | c] ∝ exp(c · vw). So:
This difference is a linear function of c. If the same difference applies to man/woman:
Since this must hold for all discourse vectors c (i.e., in all contexts), we need:
Let's verify this intuition with a concrete example. Imagine two contexts: c1 = "medieval royalty" discourse and c2 = "modern society" discourse. In context c1, both "king" and "queen" are likely (high dot products with c1), but "king" might be slightly more likely. In context c2, both "man" and "woman" are likely. The gender difference — the log-probability gap between male and female terms — should be consistent across contexts. This consistency forces the vectors to encode the relationship as a fixed direction.
More formally, the key equation is:
The only way a linear function of c can equal another linear function of c for all c is if the coefficients are equal. Hence vking − vqueen = vman − vwoman.
Arora et al. formalize this more carefully. They show that if a relationship R maps word a to word b, then the "direction" of R in embedding space is:
The average difference vector across all word pairs sharing relationship R. Under the isotropy assumption and the random walk model, this direction is approximately the same for all instances of R. So king − queen, man − woman, prince − princess all point in approximately the same direction: the "gender" direction.
Arora et al. state this as a theorem (informally): if a relationship R induces a consistent log-probability ratio across contexts, and the word vectors satisfy isotropy, then the average difference vector μR = E[vb − va] over all instances (a, b) of R captures the relationship as a linear direction.
The error bound on individual analogies depends on:
In practice, the most consistent relationships (gender, country-capital, tense) have the best analogy accuracy, and the least consistent ones (function, location-type) fail more often. The theory predicts this ordering correctly.
Let's trace through the full argument for the analogy man:king :: woman:queen. The "royalty" relationship maps common person terms to royal person terms.
In any discourse context c where both "man" and "king" are relevant topics (say, a paragraph about medieval society), the log-probability ratio is:
This ratio measures "how much more does this context prefer 'king' over 'man'?" In contexts about royalty, this ratio is positive. In contexts about everyday life, it's negative. The direction (vking − vman) captures the "royalty" dimension.
The same logic applies to queen/woman: c · (vqueen − vwoman) should give the same ratio in every context c. Since this must hold for ALL contexts c, the two directions must be equal: vking − vman = vqueen − vwoman. QED.
The theory also predicts when analogies should fail:
2D projection of word vectors. Drag words to see how the difference vectors (arrows) stay parallel for pairs that share a relationship. Add noise to see when analogies break down.
The most important (and most debatable) assumption in the paper is isotropy: that word vectors are spread uniformly in all directions. Let's unpack what this means, why it's needed, and how well it holds.
Mathematically, the set of word vectors {vw}w ∈ V satisfies isotropy if:
Where n is the vocabulary size, s2 is a scale factor, and I is the identity matrix. This says the average outer product of word vectors is proportional to the identity — no direction is preferred.
In 2D, isotropy means the word vectors don't all point up-and-to-the-right. They fan out evenly in all directions. In 300D, it means no 1D or 2D subspace captures a disproportionate fraction of the total variance.
The derivation in Chapter 2 relied on the partition function Zc being approximately constant for all discourse vectors c. This is only true when word vectors are isotropic. If all word vectors point in one direction, then Zc would be huge when c points that way and tiny when c points the opposite way. The whole derivation breaks.
More specifically, isotropy enables two key simplifications:
Partially. Empirical studies show that trained word vectors are not perfectly isotropic:
However, isotropy holds approximately after removing the top few principal components and centering. This is consistent with the theory: the dominant components capture frequency information (which is not semantic), while the remaining dimensions are approximately isotropic and carry the semantic content.
How do you quantify how isotropic a set of vectors is? One measure is the condition number of the covariance matrix C = (1/n) Σ vwvwT. For perfectly isotropic vectors, C is proportional to the identity, so all eigenvalues are equal and the condition number is 1. For highly anisotropic vectors, the top eigenvalue dominates and the condition number is large.
Empirical measurements on trained Word2Vec vectors (300 dimensions, 100K vocabulary):
This progression confirms the theory's predictions: the anisotropy is primarily driven by frequency effects (captured by the mean and top PCs), and the remaining structure is approximately isotropic and carries the semantic content.
python # Measuring isotropy of word vectors import numpy as np def measure_isotropy(vectors): """Compute condition number of covariance matrix.""" # Center the vectors centered = vectors - vectors.mean(axis=0) # Covariance matrix cov = centered.T @ centered / len(vectors) # Eigenvalues eigvals = np.linalg.eigvalsh(cov) condition = eigvals[-1] / eigvals[0] return condition def remove_top_pcs(vectors, k=3): """Remove top-k principal components (All-but-the-Top).""" centered = vectors - vectors.mean(axis=0) U, S, Vt = np.linalg.svd(centered, full_matrices=False) # Remove projections onto top-k directions for i in range(k): direction = Vt[i] centered -= np.outer(centered @ direction, direction) return centered
Isotropic vectors (left) vs. anisotropic vectors (right). Adjust the anisotropy to see how clustering in one direction breaks the partition function constancy assumption.
The random walk model makes a specific prediction about the relationship between a word's frequency and its vector norm. This prediction is testable and turns out to be remarkably accurate.
From the marginal probability derived in Chapter 2:
Taking the log:
Rearranging:
This prediction connects a purely geometric property (vector norm) to a purely statistical property (word frequency). The fact that it holds empirically is strong evidence that the random walk model captures something real about the data-generating process.
It also explains a known phenomenon: frequent words dominate similarity computations. Because ||v"the"|| >> ||v"quasar"||, the dot product between "the" and anything is large simply because "the" has a large vector. This is why cosine similarity (which normalizes by norms) works better than dot products for word similarity.
This frequency-norm relationship explains why cosine similarity works better than dot products for word similarity tasks. The dot product vw · vw' is contaminated by the norm effect: high-frequency words have large norms, so their dot products with everything are large. Cosine similarity divides by the norms, effectively removing the frequency component and isolating the directional (semantic) component.
In formula terms: if ||vw||2 ≈ (2/σ2) log Pr[w], then:
The dot product mixes frequency (the norm product) and semantics (the angle). Cosine similarity strips the frequency part, giving you pure directional similarity. This is exactly what you want for semantic comparison.
The paper also shows that within narrow frequency bands (words of similar frequency), the isotropy assumption holds much better. This makes sense: the norm variation is mainly driven by frequency, and within a frequency band, norms are similar, so the remaining variation is more isotropic.
This suggests a practical heuristic: when comparing word vectors, normalize by frequency band. Or even simpler: use cosine similarity (which already handles the norm issue) and center the vectors (which removes the mean direction, another frequency artifact).
Let's walk through a concrete verification. Take a trained Word2Vec model (Google News, 300d) and check the frequency-norm prediction:
| Word | Frequency rank | ||v|| | ||v||2 | log(freq) |
|---|---|---|---|---|
| the | 1 | 4.2 | 17.6 | 18.4 |
| good | ~200 | 3.8 | 14.4 | 13.1 |
| algorithm | ~5K | 3.3 | 10.9 | 9.8 |
| pulsar | ~50K | 2.1 | 4.4 | 7.5 |
| quasar | ~80K | 1.8 | 3.2 | 6.9 |
The trend is clear: more frequent words have larger norms, and the relationship between ||v||2 and log(freq) is approximately linear, exactly as the theory predicts. The slope (2/σ2) depends on the training corpus and hyperparameters, but the linearity holds robustly.
python import numpy as np import matplotlib.pyplot as plt # Load pretrained Word2Vec vectors and word frequencies vectors = load_word2vec('GoogleNews-vectors.bin') freqs = load_frequencies('corpus_counts.txt') # Compute norms squared and log frequencies words = [w for w in freqs if w in vectors] norms_sq = np.array([np.linalg.norm(vectors[w])**2 for w in words]) log_freq = np.log(np.array([freqs[w] for w in words])) # Fit linear regression: ||v||^2 ≈ a * log(freq) + b coeffs = np.polyfit(log_freq, norms_sq, 1) # Typical result: R^2 ≈ 0.85-0.90, confirming the theory plt.scatter(log_freq, norms_sq, alpha=0.1, s=1) plt.plot(log_freq, np.polyval(coeffs, log_freq), 'r-') plt.xlabel('log(frequency)') plt.ylabel('||v_w||^2') plt.title('Frequency-Norm Law (Arora et al.)')
Simulated scatter plot of log(word frequency) vs. vector norm squared. The linear relationship predicted by the theory appears clearly. Adjust σ2 to see how the discourse variance affects the slope.
Let's put the entire theory together. The random walk model makes three testable predictions: (1) PMI ≈ dot product, (2) linear analogies emerge, (3) norm ∝ log frequency. We can simulate all three in a single interactive environment.
A complete generative model. Set the vocabulary size and discourse variance, then generate a synthetic corpus and see if the three predictions hold. Click "Generate Corpus" to run.
The paper validates the theory on real Word2Vec vectors trained on Wikipedia. The key measurements:
| Prediction | Measured | Theory |
|---|---|---|
| PMI vs. dot product | R2 ≈ 0.85 | Linear with slope σ2 |
| Norm vs. log frequency | R2 ≈ 0.88 | Linear with slope 2/σ2 |
| Partition function Zc | CV ≈ 0.08 | ≈ constant (CV should be small) |
| Analogy direction consistency | cos(diff vectors) ≈ 0.75 | High for clean relationships |
The R2 values of 0.85–0.88 are remarkably high for such a simple theory. The remaining 12–15% of variance is attributed to finite corpus effects, polysemy, and deviations from isotropy.
Play with the simulation above and notice:
Word2Vec (Mikolov et al., 2013): The algorithm whose success demands theoretical explanation. Arora et al. provide the first generative model that explains its key properties.
Neural Word Embedding as Implicit Matrix Factorization (Levy & Goldberg, 2014): Proved W · CT = PMI − log(k). Arora et al. explain why this factorization produces meaningful vectors.
GloVe (Pennington et al., 2014): An explicit matrix factorization approach. Arora et al.'s theory applies to GloVe too, since it also optimizes a function of co-occurrence statistics.
Latent Semantic Analysis (Deerwester et al., 1990): SVD on co-occurrence matrices. The random walk model retroactively explains why LSA works: it's factorizing a matrix that approximates PMI.
All-but-the-Top (Mu & Viswanath, 2018): Post-processing word vectors by removing top principal components. Directly motivated by the isotropy analysis: removing anisotropic components makes the PMI approximation tighter.
SIF embeddings (Arora et al., 2017): The same authors extended their theory to sentence embeddings. A simple weighted average of word vectors, with the first principal component removed, gives surprisingly strong sentence representations. This follows from the random walk model: the discourse vector at any point is well-approximated by the weighted average of nearby word vectors.
Contextual embeddings understanding: The discourse vector concept foreshadowed contextual embeddings (ELMo, BERT). The discourse vector is essentially a context-dependent representation — BERT just makes it explicit and learnable.
Evaluation with Arora's framework (Schnabel et al., 2015; Levy et al., 2015): The three papers together form a complete picture. Arora explains why word vectors work. Levy et al. explains what design choices matter. Schnabel et al. explains how to evaluate whether they worked. Reading them together gives a comprehensive understanding of the word embedding era.
Decontextualized word embeddings: Even in the BERT era, the Arora model remains relevant. Methods like Static BERT (averaging BERT's contextual representations across contexts to get a static vector) can be understood through the random walk lens: the average contextual representation is an estimate of the expected discourse vector weighted by the word's usage distribution.
Despite the elegance of the theory, several questions remain open:
These questions represent active research directions that build on the foundation this paper established. The random walk model may not be the final word on why embeddings work, but it's the first word — and it set the stage for everything that followed.
This paper is best understood alongside two contemporaries:
| Paper | Year | Key Question | Answer |
|---|---|---|---|
| Levy & Goldberg | 2014 | What does Skip-gram compute? | A shifted PMI matrix factorization |
| Levy, Goldberg & Dagan | 2015 | Why do neural methods win? | They don't — it's hyperparameters |
| Arora et al. | 2016 | Why does PMI capture meaning? | Discourse coherence creates a low-rank PMI structure |
Together, these three papers form a complete theoretical understanding of the word embedding era. Levy & Goldberg showed the what (PMI factorization). Levy et al. showed the how (hyperparameters matter most). Arora et al. showed the why (discourse coherence makes it work). By 2016, word embeddings were no longer magic — they were a well-understood technology with clear theoretical foundations.
The paper's random walk framework established a template for theoretical analysis in NLP:
This template has been applied to topic models, sentence embeddings, attention mechanisms, and even transformer representations. The specific model (random walk) may be superseded, but the methodology of explaining "why algorithms work" through generative modeling remains a cornerstone of NLP theory.
Even if you never train word embeddings from scratch (most practitioners use pretrained models), the random walk theory offers practical guidance:
The model suggests a design principle for any representation learning system: if you want representations with clean algebraic structure, design your training objective to factorize a PMI-like matrix. This principle has been applied beyond word embeddings:
In every case, the key ingredient is local coherence: nearby elements are related, and this relatedness creates co-occurrence statistics whose PMI is low-rank. The random walk model identifies this as the universal mechanism behind meaningful embeddings, regardless of the domain.
There is a deep but underappreciated connection between the random walk model and the attention mechanism in transformers. In the random walk model, the discourse vector ct determines which words are likely. In a transformer's attention mechanism, the query vector q determines which keys (and therefore values) are attended to.
The log-linear emission Pr[w|c] ∝ exp(c · vw) is structurally identical to the attention weight: a(q, k) ∝ exp(q · k / √d). Both use a softmax over dot products. The discourse vector plays the role of the query; word vectors play the role of keys.
This analogy suggests that transformers are, in a sense, learning the discourse vector at each position. The self-attention mechanism computes a context-dependent representation (analogous to ct) that determines which tokens are relevant (analogous to which words are likely). The random walk model is the simplest version of this computation — a stochastic, unlearned version of attention.
For intellectual honesty, let's be clear about the theory's boundaries:
These are important limitations. The random walk model explains the population-level statistics of word co-occurrence. It does not explain the sample-level behavior of training algorithms, the compositional structure of phrases, or the transfer properties of learned representations. These remain open theoretical questions.
Before Arora et al., Word2Vec was celebrated as a triumph of engineering over theory. The algorithm worked spectacularly, but nobody could explain why king − man + woman gave queen. The random walk model showed that this "unreasonable effectiveness" was, in fact, perfectly reasonable: it's a natural consequence of discourse coherence and log-linear word emission.
The paper's true significance is meta-scientific. It demonstrated that even for seemingly mysterious ML phenomena, principled theoretical analysis is possible and fruitful. Theory doesn't just explain — it generates new methods (SIF embeddings, All-but-the-Top), new diagnostics (frequency-norm plots), and new understanding (polysemy as vector averaging). In a field increasingly dominated by empirical scaling, this paper stands as a reminder that understanding why something works is the foundation for making it work better.
Text is generated by a slowly-walking discourse vector. Words near the current discourse are likely to be emitted. This creates co-occurrence patterns where PMI equals the dot product of word vectors (scaled by the walk's variance). Frequent words have large vector norms because they're emitted from many discourse positions. Analogies are linear because consistent relationships induce consistent log-probability ratios across all discourse contexts. And 300 dimensions suffice because discourse only has about 300 independent topical directions. One model, four predictions, all confirmed empirically. That's the power of theory.
SIF Sentence Embeddings (Arora et al., ICLR 2017): The random walk model implies that the discourse vector at any point is well-approximated by a weighted average of nearby word vectors. This leads to a simple sentence embedding method:
python # SIF sentence embeddings (Arora et al., 2017) # Derived directly from the random walk model def sif_embedding(sentences, word_vectors, word_freqs, a=1e-3): """Compute sentence embeddings via Smooth Inverse Frequency.""" embeddings = [] for sent in sentences: # Step 1: Weighted average (downweight frequent words) vecs = [] for w in sent: if w in word_vectors: weight = a / (a + word_freqs.get(w, 1e-6)) vecs.append(weight * word_vectors[w]) embeddings.append(np.mean(vecs, axis=0)) embeddings = np.array(embeddings) # Step 2: Remove first principal component # (removes common discourse direction = background topic) u, s, vt = np.linalg.svd(embeddings, full_matrices=False) pc1 = vt[0] embeddings -= np.outer(embeddings @ pc1, pc1) return embeddings
This method achieves surprisingly strong performance on sentence similarity tasks, outperforming many learned methods.
The connection to the random walk model is direct: the discourse vector ct generates the words in a sentence. By Bayes' rule, we can approximately recover ct from the observed words. The weighted average with downweighting of frequent words (which carry little information about ct) is the maximum likelihood estimate. Removing the first PC corresponds to removing the common discourse direction (the "background topic"), isolating sentence-specific meaning.
The random walk model also explains why 300 dimensions suffice for a 100K-word vocabulary. The discourse vector ct lives in d-dimensional space, and all word probabilities are determined by dot products with ct. If the discourse vector only explores a d-dimensional subspace (d << V), then only d dimensions of the word vectors matter for predicting co-occurrence.
How large is d in practice? It's determined by the "effective dimensionality" of discourse — how many independent topical directions exist. In a corpus covering politics, science, sports, technology, arts, etc., there might be 300–500 roughly independent topical axes. More dimensions than this capture noise rather than signal.
This also explains why increasing d beyond 500 rarely helps: you've already captured all the independent directions of discourse variation. Additional dimensions model corpus-specific noise that doesn't generalize.
There's a deeper reason why low dimensionality suffices. The mutual information between a word and its context is bounded:
This mutual information is finite and relatively small (a few bits for typical corpora). Since PMI ≈ σ2 vw · vc, the mutual information constrains the total "spread" of the word vectors. You only need enough dimensions to capture this finite mutual information. Beyond that, additional dimensions would encode zero-information noise.
Empirically, I(W;C) is typically 2–4 bits for natural language with window size 5. With d = 300 dimensions, each dimension carries about 0.01 bits of mutual information — a vanishingly small amount per dimension, but enough in aggregate to capture the essential structure.
Let's trace the full chain from language to learned vectors. This synthesis connects this paper to Levy & Goldberg (2014) and Levy, Goldberg & Dagan (2015):
The random walk model is not the only theory for word embeddings, but it's the most complete. Other theoretical perspectives include:
Arora et al.'s model is the first to derive the PMI connection, linear analogies, frequency-norm law, and low dimensionality from a single generative process. It unifies the empirical observations into a coherent theoretical framework.
In a follow-up work, Arora et al. (2018) extended the model to handle polysemous words — words with multiple meanings. The key result: a polysemous word's vector is approximately the weighted average of its sense vectors, weighted by sense frequency.
where p1 and p2 are the relative frequencies of the two senses. This explains a puzzling empirical observation: "bank" is somewhat near both "money" and "river," not fully in either cluster. The vector is a compromise weighted by how often each sense appears in the training corpus.
The paper also shows that you can recover individual sense vectors from the composite vector using sparse coding — decompose the word vector as a sparse combination of basis "atom" vectors that represent pure senses. This provides a principled method for word sense disambiguation using only word vectors.
The random walk model is extremely simplified. Real language has: hierarchical syntax (not captured), long-range dependencies (not captured), pragmatic implicature (not captured), metaphor (not captured). Yet the model's predictions are remarkably accurate. Why?
The answer is that word co-occurrence statistics are lossy projections of the full linguistic structure. When you compute PMI from a window of size 5, all syntactic structure within that window is collapsed into a single count. The random walk model captures exactly this collapsed, context-window-level statistics. It's not a model of language — it's a model of what co-occurrence statistics see of language. And for that purpose, it's excellent.
The discourse vector ct is, in some sense, a precursor to contextual embeddings. Think about it:
The conceptual leap from Arora to BERT: instead of having the discourse vector as a latent variable that generates words, make it an explicit representation that a neural network computes. The discourse vector becomes the hidden state of a transformer. The log-linear emission probability becomes the language model's softmax layer.
Let's collect the paper's main results in one place:
A polysemous word's vector is the weighted average of its sense vectors. Adjust the sense balance to see how the composite vector moves between the two clusters.