Arora, Li, Liang, Ma, Risteski (Princeton) — TACL 2018

Linear Algebraic Structure of Word Senses

Polysemous words are linear superpositions of their individual sense vectors — and you can recover each sense with sparse coding.

Prerequisites: Word vectors (word2vec) + Basic linear algebra + Intuition for cosine similarity
8
Chapters
8
Simulations

Chapter 0: The Problem

Consider the word "bank." In "I deposited money at the bank," it means a financial institution. In "We sat on the river bank," it means the edge of a river. These two meanings share nothing semantically — they are completely unrelated concepts that happen to share a spelling.

Now, word embedding algorithms like word2vec and GloVe assign every word exactly one vector. So when you train word2vec on a large corpus, the word "bank" gets a single 300-dimensional vector. What does that vector represent? The financial meaning? The river meaning? Some weird average of both?

This is the polysemy problem: words with multiple meanings get collapsed into a single point in embedding space, losing crucial semantic distinctions.

And "bank" is an easy case — it only has two main senses. Consider "spring": a season (spring flowers), a water source (hot spring), a coiled metal device (spring mechanism), and a verb (spring into action). Or "light": illumination, not heavy, to ignite, pale in color. English is riddled with polysemy. Over 80% of the 500 most common English words have multiple meanings.

The standard response in NLP was to either (a) ignore the problem and hope the single vector is "good enough," or (b) try to learn multiple vectors per word. Option (a) loses information. Option (b) is complex, brittle, and requires knowing the number of senses in advance — should "run" get 5 vectors? 10? 30? There's no good answer.

Arora et al. found a third way: prove that the single vector already contains all the sense information, then extract it mathematically. The key realization is that the single vector isn't a "lossy average" — it's a lossless encoding (up to the limits of the embedding dimension). The different senses are all in there, superposed like radio signals on different frequencies. You just need the right decomposition technique to tune into each one.

The Polysemy Problem

The word "bank" appears in two completely different contexts. Toggle between senses to see how a single vector struggles to represent both.

The key question: Is the single word vector for a polysemous word just a useless blob? Or does it encode the different senses in a structured, recoverable way? Arora et al. (2018) prove something remarkable: the combined vector is a linear combination of the sense vectors, weighted by how often each sense appears. And you can untangle them.

The result is surprising because you might expect the nonlinear training process (sigmoid functions, negative sampling) to create a nonlinear mixing of senses. But the mathematics of high-dimensional spaces comes to the rescue. In 300 dimensions, there's enough "room" for different senses to be nearly orthogonal — and when you average nearly orthogonal vectors, the average encodes both components in a recoverable way.

To see why this is surprising, consider a low-dimensional analogy. Imagine you have two different sounds — a piano note and a drum hit. If you mix them into a single audio recording, can you recover the individual sounds? Only if the sounds are "different enough" (technically, if their frequency spectra don't overlap too much). In high dimensions, ALL vectors are "different enough" — they're all nearly orthogonal to each other. So mixing ANY set of sense vectors into a single word vector preserves enough structure to be unmixed.

This is the same principle that makes compressed sensing work. In compressed sensing, you take a sparse signal (one with only a few nonzero components), project it to a lower dimension, and can still recover the original signal exactly. The reason: in high dimensions, random projections preserve sparse structure.

Arora et al.'s superposition is the embedding-space analog: each word vector is a "compressed representation" of its component senses, and sparsity (few senses per word) makes recovery possible. The L1 penalty in the sparse coding objective is the same L1 penalty used in LASSO regression and basis pursuit in compressed sensing. The algorithms are identical; only the domain changes.

This universality is what makes the result so powerful. It connects word embeddings to a vast body of mathematical theory about sparse representations, overcomplete dictionaries, and recovery guarantees. Any improvement in sparse recovery algorithms (and there have been many since 2018) directly improves sense vector recovery.

This isn't just a curiosity. If you can decompose a polysemous word vector into its individual senses, you can build better NLP systems that understand which meaning is intended in context. And the fact that superposition works at all reveals something deep about how word embeddings organize meaning.

Consider more examples. The word "run" has over 30 senses in English: run a race, run a company, run a program, run water, a run in stockings. The word "set" has over 400 senses in some dictionaries — holding a record for polysemy. Each of these senses has its own semantic neighborhood, its own set of related words. Yet word2vec gives each word just one vector.

And this problem runs deeper than just individual words. Metonymy ("Washington announced new sanctions" — Washington = the US government) and metaphor ("a river of tears," "a flood of information") create context-dependent meanings for thousands of words. Even common words like "head" (body part, leader, top of a list, source of a river) carry multiple meanings that a single vector conflates.

How bad is this problem in practice? A study of the top 10,000 most frequent English words found that over 60% are polysemous. The 500 most frequent words average 7.3 senses each. Polysemy isn't a rare edge case — it's the norm. Any theory of word embeddings that ignores polysemy is ignoring the majority of the vocabulary.

The problem is even more acute for downstream tasks. In word sense disambiguation (WSD), the system must determine which sense of a polysemous word is intended in a given sentence. If the word vector doesn't encode sense distinctions, WSD performance degrades. In information retrieval, searching for "java" (the programming language) brings up results about Indonesian islands. In machine translation, "bank" might be translated incorrectly if the wrong sense is assumed. Polysemy is a practical bottleneck, not just a theoretical curiosity.

What is the fundamental challenge polysemy poses for standard word embeddings?

Chapter 1: One Word, One Vector?

Before we can understand why superposition works, we need to understand what a word vector is and why it captures meaning at all.

What word vectors encode

Word2vec and GloVe learn vectors by looking at co-occurrence patterns. Words that appear in similar contexts get similar vectors. "King" and "queen" appear near words like "royal," "throne," "crown" — so their vectors end up close together. "King" and "bicycle" appear in very different contexts — so their vectors are far apart.

The remarkable thing is that these vectors capture relational structure. The famous example: vec("king") - vec("man") + vec("woman") ≈ vec("queen"). The difference between "king" and "man" encodes the concept of royalty, and adding it to "woman" lands you near "queen." This means directions in embedding space correspond to semantic relationships.

Other examples of linear structure: vec("Paris") - vec("France") + vec("Japan") ≈ vec("Tokyo"). vec("walking") - vec("walk") + vec("swim") ≈ vec("swimming"). These relationships hold because the co-occurrence statistics of "Paris" relative to "France" mirror those of "Tokyo" relative to "Japan." The vector arithmetic captures regularities in how words relate to each other.

But here's the tension: if word vectors are so good at capturing meaning through linear structure, how do they handle the fundamentally non-linear phenomenon of polysemy? A word with two unrelated meanings shouldn't satisfy any simple linear relationship — yet it does. Understanding why requires diving into the mathematics of high-dimensional geometry.

Why didn't people check this before?

The superposition property seems like something someone would have noticed earlier. But there were two obstacles:

  1. No theoretical framework: Without the random walk discourse model (Arora et al. 2016), there was no reason to expect linear superposition. People assumed the combination would be nonlinear because word2vec's training objective involves nonlinear functions (sigmoid, softmax). The insight that these nonlinearities wash out in high dimensions was not obvious.
  2. Hard to verify directly: To check superposition, you need sense-specific vectors — but those don't exist in standard word2vec outputs. You need a sense-tagged corpus (rare and expensive) or a clever indirect method (like the sparse coding approach). The paper provided both the theory to predict superposition and the tools to verify it.

The problem with polysemy

But consider "bank." It co-occurs with "money," "account," "deposit" (financial sense) AND with "river," "water," "fishing" (river sense). The word2vec training procedure sees ALL these contexts and produces a single vector that tries to satisfy both sets of relationships simultaneously.

Key insight from prior work: Huang et al. (2012) tried to fix this by training multiple vectors per word — one per sense. But this requires knowing how many senses each word has (hard), clustering contexts (error-prone), and retraining the entire model. Arora et al. ask a different question: can we recover the sense vectors from the existing single vector, without retraining?

A concrete experiment

To see why a single vector is problematic, try this thought experiment. Take the word "cell" (biology: a living cell; technology: a cell phone; prison: a jail cell). Compute its nearest neighbors in word2vec:

  1. cells (biology)
  2. cellular (technology)
  3. tissue (biology)
  4. phone (technology)
  5. prison (prison)

The neighbors are a mixed bag from all three senses. This means any task that relies on nearest neighbors — information retrieval, word sense disambiguation, analogies — gets confused. You can't tell which meaning of "cell" is relevant without decomposing the vector.

python
import gensim.downloader as api

# Load pre-trained word2vec vectors
model = api.load('word2vec-google-news-300')

# Nearest neighbors for "bank" — a mix of senses
for word, sim in model.most_similar('bank', topn=10):
    print(f"  {word:15} {sim:.3f}")
# Output mixes financial ("banking", "lender") and river ("riverbank")

# The vector for "bank" is somewhere between both senses
# cos(bank, money) ≈ 0.48, cos(bank, river) ≈ 0.33
print(model.similarity('bank', 'money'))   # 0.48
print(model.similarity('bank', 'river'))   # 0.33

What "recovery" means

Imagine the word "bank" has two senses with vectors v1 (financial) and v2 (river). If the combined vector vbank is a linear combination:

vbank = α1 v1 + α2 v2

where α1 and α2 are weights proportional to how often each sense appears in the training corpus, then we might be able to decompose vbank to recover v1 and v2 separately. But first, we need to prove that this linear relationship actually holds.

Sense Vectors as Directions

Each word sense has a direction in embedding space. The combined vector for a polysemous word is a weighted sum of its sense directions. Drag the weight slider to change the sense balance.

α1 (financial)0.60
Nearest Neighbor Confusion

A polysemous word (white) has neighbors from BOTH senses — see how the neighbor list mixes unrelated concepts. Click "Randomize" to try different positions.

Why does a standard word2vec vector for "bank" end up between the financial and river meanings in embedding space?

Chapter 2: The Superposition Claim

Here is the central claim of the paper, stated precisely:

Superposition Theorem (informal): If a word w has senses s1, s2, ..., sk occurring with frequencies f1, f2, ..., fk in the corpus, then the word vector vw is approximately a weighted linear combination of the sense vectors:
vw ≈ α1 vs1 + α2 vs2 + ... + αk vsk
where αi = fiζ for some parameter ζ ∈ (0, 1), and vsi is the vector the sense would have if it were its own word.

Several things to unpack here:

1. Linear, not nonlinear

The combination is a plain weighted sum — no activation functions, no complicated transforms. This is surprising. You might expect the word2vec training objective (which involves sigmoid functions and dot products) to produce some nonlinear mixing of senses. But it doesn't.

2. Frequency-weighted

The weights are not arbitrary. They depend on how often each sense appears. If 90% of "bank" occurrences are financial, then α1 is much larger than α2, and vbank is much closer to the financial sense vector.

3. The exponent ζ

The weights are not simply the frequencies — they are frequencies raised to a power ζ between 0 and 1. This is a sublinear weighting. Why? Because word2vec's implicit weighting of co-occurrence statistics isn't linear in frequency. The paper shows ζ depends on the specific embedding algorithm (word2vec vs GloVe) and its hyperparameters.

4. "Would have" vectors

The sense vector vsi is defined as the vector the sense would get if we could train word2vec on a corpus where every occurrence of "bank" was replaced by "bank_financial" or "bank_river" depending on the intended sense. These hypothetical vectors are well-defined even though we never explicitly train them.

This is a subtle but critical point. We're NOT saying you need to manually disambiguate the corpus. The claim is that vbank can be decomposed into components that match what you'd get from a disambiguated corpus. The sense vectors exist as latent directions within the embedding space — we just need a way to find them.

Testing the claim empirically

How do you verify that superposition actually holds? The paper uses a clever trick. Take a polysemous word like "bank." Find all sentences in the corpus containing "bank" and manually label them as financial or river. Compute the average context vector for each sense (the average of neighboring word vectors). These context-based sense vectors approximate the "would have" vectors vs1 and vs2.

Then check: is vbank ≈ α1 vs1 + α2 vs2? The paper finds this holds with cosine similarity > 0.95 for most polysemous words tested. The linear model is an excellent fit.

Superposition Fit Quality

For different polysemous words, the cosine similarity between the actual word vector and the best linear combination of sense vectors. Values above 0.90 indicate the superposition model fits well.

python
# Verify superposition empirically
import numpy as np

# v_bank: the actual word2vec vector for "bank"
# v_s1: average context vector for financial sense
# v_s2: average context vector for river sense
# f1, f2: frequency of each sense in corpus

def verify_superposition(v_word, sense_vecs, freqs, zeta=0.7):
    """Check if v_word ≈ sum(alpha_i * v_si)"""
    alphas = np.array([f**zeta for f in freqs])
    alphas /= alphas.sum()  # normalize

    # Weighted sum of sense vectors
    v_combined = sum(a * v for a, v in zip(alphas, sense_vecs))

    # Cosine similarity with actual word vector
    cos_sim = np.dot(v_word, v_combined) / (
        np.linalg.norm(v_word) * np.linalg.norm(v_combined))
    return cos_sim, alphas

# Typical result: cosine_sim > 0.95
Superposition in Action

A polysemous word with 3 senses. Each sense has a direction (colored arrows). The word vector (white) is their weighted sum. Drag frequency sliders to change the mix.

Sense 1 freq0.50
Sense 2 freq0.30
Sense 3 freq0.20

Empirical evidence

The paper verifies this on real word vectors. For words with known senses (from WordNet), they compute the cosine similarity between vw and the best linear combination of sense vectors estimated from sense-tagged corpora. The fit is strong — typically r > 0.9. The linear model explains over 80% of the variance in the word vector.

The verification procedure works as follows:

  1. Select a set of polysemous words with well-separated senses from WordNet (e.g., "bank," "crane," "bat," "spring"). WordNet provides a standardized sense inventory for each word.
  2. For each word, use a sense-tagged corpus (SemCor or OMSTI) to identify which sense is used in each occurrence. SemCor annotates approximately 234,000 instances across 352 texts.
  3. Compute the sense vector as the centroid of context vectors for occurrences of that sense. The context vector for an occurrence is the average of word2vec vectors of the 5 words on each side.
  4. Find the optimal weights αi by least-squares regression: minimize ||vw − ∑ αi vsi|| subject to αi ≥ 0 (nonnegative constraint ensures the combination is physically meaningful).
  5. Report the R2 (fraction of variance explained) and cosine similarity between the predicted combination and the actual word vector.

The results are consistently strong across different embedding algorithms (word2vec CBOW, word2vec skip-gram, GloVe), different vector dimensions (100, 200, 300), and different corpora (Wikipedia, Google News, Common Crawl). The superposition relationship is robust — it's not an artifact of a particular training setup. Even negative sampling rate (k = 5 vs k = 15) and window size (5 vs 10) don't meaningfully change the quality of the linear fit. This universality strongly supports the theoretical explanation based on high-dimensional geometry rather than algorithm-specific properties.

A sanity check: The αi weights recovered by least squares match the known sense frequencies from the tagged corpus. For "bank," the financial sense frequency is about 0.7 in news corpora, and the recovered α1 (after applying the ζ exponent correction) matches this. The theory's predictions about the relationship between frequency and weight are confirmed.

What ζ tells us

The exponent ζ in the formula αi = fiζ has a value around 0.6-0.8 for word2vec and 0.5-0.7 for GloVe. What does this mean intuitively?

If ζ = 1, the weights would be proportional to the raw frequencies — a 90% dominant sense would contribute 90% to the word vector. If ζ = 0, all senses would contribute equally regardless of frequency.

The fact that ζ ∈ (0.5, 0.8) means the combination is frequency-weighted but democratized. A rare sense (10% frequency) contributes more than 10% but less than an equal share. Specifically, with ζ = 0.7: a sense with frequency 0.1 gets weight 0.10.7 ≈ 0.20 (twice its frequency share), while a sense with frequency 0.9 gets weight 0.90.7 ≈ 0.93 (slightly less than its frequency share). This "compression" of the frequency distribution is what makes rare senses recoverable — they get more representation in the word vector than their raw frequency would suggest.

This is actually good news for sense recovery. If rare senses were completely dominated by frequent senses (ζ = 1), they'd be invisible in the word vector. The sublinear weighting (ζ < 1) ensures that even a 5% sense leaves a detectable trace.

In the superposition claim, how are the weights αi related to sense frequencies?

Chapter 3: The Random Walk Model

The superposition result doesn't appear out of thin air. It follows from a specific theoretical model of how text is generated and how word vectors arise from that generative process. This model was introduced in an earlier paper by Arora et al. (2016) and forms the mathematical backbone of the polysemy result.

The discourse vector

Imagine you're writing a document. At each moment, there's a discourse vector ct — a vector in the same space as word embeddings — that represents "what you're currently talking about." When you're writing about finance, ct points in the direction of financial concepts. When you're describing nature, ct shifts toward nature-related directions.

The discourse vector evolves over time as a random walk:

ct+1 = ct + ηt

where ηt is a small random perturbation. This means the topic drifts slowly and smoothly — consecutive words tend to be about the same thing, but the topic gradually shifts over the course of a document. This matches how real text works.

Word generation

At each position t, the probability of emitting word w is:

Pr[wt = w | ct] ∝ exp(vw · ct)

Words whose vectors align with the current discourse vector are more likely to appear. If ct points toward finance, words like "deposit," "account," "interest" have high dot product with ct and are likely to appear.

Why this model matters: This generative model explains WHY word vectors have the structure they do. If text is generated by this process, then the word2vec training objective (predicting co-occurring words) provably recovers vectors vw that satisfy the co-occurrence relationships. It's not just a heuristic — the model gives a principled reason for the structure.

Connection to PMI

The model implies that the pointwise mutual information (PMI) between two words relates to their vector dot product:

PMI(w, w') ≈ vw · vw' / d

where d is the embedding dimension. This is exactly the relationship that GloVe and word2vec implicitly optimize. The random walk model thus unifies different embedding algorithms under one theoretical roof.

Why the random walk is realistic

The random walk model captures three key properties of real text:

  1. Local coherence: Consecutive words tend to be about the same topic. The small step size of the random walk ensures ct and ct+1 are close.
  2. Topic drift: Over longer stretches, the topic changes. A paragraph about cooking might transition to nutrition, then to health. The cumulative effect of many small random steps creates these slow topic shifts.
  3. Context-dependent word choice: The probability of each word depends on the current "context" (discourse vector), which matches the fundamental assumption of word embedding algorithms — that context determines word meaning.
The random walk generates the right statistics: Starting from this simple model, Arora et al. (2016) proved that the word vectors recovered by word2vec and GloVe satisfy vw · vw' ≈ d · PMI(w, w'), that the famous king-queen analogy works, and that the vectors lie in a low-dimensional subspace. The model is simple enough to prove things about, yet rich enough to capture the structure of real word vectors.

The key insight for polysemy: when the random walk visits the "financial district" of the embedding space, it generates financial-sense occurrences of "bank." When it visits the "nature" region, it generates river-sense occurrences. The word vector for "bank" is the average of all these positions, weighted by how often the walk visits each region. This is exactly the superposition formula.

How fast does the walk mix?

An important parameter is the step size of the random walk. If the walk takes tiny steps, consecutive words are about exactly the same topic — strong local coherence but slow topic changes. If the walk takes large steps, topics change rapidly — weak coherence but more diverse co-occurrence patterns.

The step size determines the context window of the embedding algorithm. Word2vec uses a context window of 5-10 words. Within this window, the discourse vector hasn't moved much, so all words share approximately the same context. Across windows, the discourse vector has drifted, creating the contrastive signal that makes different words distinguishable.

For polysemy, the mixing time determines how often the walk transitions between sense regions. If "bank" appears in a financial document, the walk stays in the financial region for many paragraphs before potentially wandering toward a river context. The relative time spent in each region determines the sense frequencies f1, f2.

Connecting the model to data: The random walk model isn't just a thought experiment. Arora et al. (2016) showed that real text statistics (word frequency distributions, co-occurrence patterns, analogy structures) are consistent with this generative model. The model's predictions — that word vectors encode PMI, that analogies work via vector arithmetic, and now that polysemous words superpose linearly — are all confirmed empirically. It's a remarkably successful theory.
Discourse Random Walk

Watch the discourse vector (white dot) random-walk through embedding space. Words nearby the discourse vector get emitted. Click "Step" to advance, "Reset" to start over.

Click Step to begin
In the random walk model, what determines which word is most likely to appear at position t?

Chapter 4: Why Superposition Holds

Now we can prove why the word vector for a polysemous word is a linear combination of its sense vectors. The proof follows directly from the random walk model, and understanding it reveals exactly when and why superposition works.

Setup

Suppose word w has k senses. In the corpus, fraction fi of occurrences of w use sense i. Each sense i has its own "sense vector" vsi — the vector w would get if only sense-i occurrences existed.

Step 1: Co-occurrence decomposes by sense

The key observation is that when word w co-occurs with some other word w', it's because one particular sense of w is active in that context. If "bank" appears next to "deposit," the financial sense is active. If "bank" appears next to "river," the river sense is active. So the total co-occurrence count decomposes by sense:

#(w co-occurs with w') = ∑i #(sense i of w co-occurs with w')

In the random walk model, the probability of w and w' co-occurring when sense i is active depends on vsi · vw'. So the total log co-occurrence rate is:

PMI(w, w') = log( ∑i fi · exp(vsi · vw' / d) ) − log(base rate)

Step 2: The log-sum-exp approximation

This is where the key mathematical step happens. We need to show that the function log(∑ fi exp(xi)) is approximately linear in the xi values. Why would a clearly nonlinear function (log of sum of exponentials) behave linearly?

The answer lies in the regime we're operating in. In high dimensions (d = 300), the dot product vsi · vw' is typically small — on the order of 1/√d ≈ 0.06. When the arguments to exp() are this small, exp(x) ≈ 1 + x + x2/2. The quadratic and higher terms are negligible, making the whole expression approximately linear.

Formally, the paper proves a lemma: when the xi values have variance σ2 that is small (specifically, σ2 · max|xi| << 1), then:

log(∑i fi · exp(xi)) ≈ ∑i fiζ xi / (∑j fjζ)

where ζ = 1 − σ2/2 + O(σ4). In the high-dimensional regime, ζ ∈ (0.5, 1.0) typically. The exponent ζ corrects for the curvature of the exponential — it's a first-order adjustment that makes the linear approximation much more accurate.

Why linear? In high dimensions, random vectors are nearly orthogonal. This means vsi · vw' is a small number for most w'. When the arguments to exp() are small, exp(x) ≈ 1 + x, making log-sum-exp approximately linear. The few exceptions (when w' is very similar to one sense) don't dominate the overall statistics. This is a beautiful consequence of the concentration of measure in high dimensions.

Concentration of measure — the key ingredient

Why are dot products small in high dimensions? This is one of the most important facts in high-dimensional geometry, and it's counterintuitive. In 2D, two random unit vectors can easily have a dot product of 0.7 or 0.8. In 300D, the dot product between two random unit vectors is almost always between −0.1 and +0.1.

Formally, if v and w are independent random vectors on the unit sphere in Rd, then:

E[v · w] = 0,   Var[v · w] = 1/d

In d = 300, the standard deviation of the dot product is 1/√300 ≈ 0.058. So most dot products are within 2 standard deviations of zero: between −0.12 and +0.12. This means the arguments to exp() in the log-sum-exp are small, and the linear approximation is excellent.

This is why superposition ONLY works in high dimensions. In d = 2 or d = 10, the dot products would be large, the linear approximation would fail, and polysemous word vectors would not be recoverable superpositions. The high dimensionality of word embeddings (d = 300) is not just a hyperparameter choice — it's a mathematical prerequisite for the superposition structure.

Dimensional scaling: The approximation error of the linear superposition formula scales as O(1/d). In d = 50, the relative error is about 2%. In d = 300, it's about 0.3%. In d = 768 (BERT's dimension), it's about 0.13%. As embedding dimensions grow, superposition becomes more and more exact. This is why the principle extends cleanly to modern, high-dimensional representations.

Step 3: From PMI to vectors

Since PMI(w, w') ≈ vw · vw' / d for any w', and we showed the PMI decomposes as a weighted sum of sense-specific terms, we can conclude:

vw · vw' ≈ (∑i αi vsi) · vw'

where αi ∝ fiζ. Since this holds for all w', we conclude:

vw ≈ ∑i αi vsi

The word vector is a weighted sum of sense vectors. QED.

Worked example

Let's trace through the proof with numbers. Suppose "bank" has two senses with frequencies f1 = 0.7 (financial) and f2 = 0.3 (river). In 300 dimensions, a typical dot product vs1 · vw' ≈ 0.02 (small, due to near-orthogonality). So:

log(0.7 · exp(0.02) + 0.3 · exp(−0.01))
≈ log(0.7 · 1.020 + 0.3 · 0.990)
≈ log(0.714 + 0.297) = log(1.011) ≈ 0.011

Compare with the linear approximation:

0.70.7 · 0.02 + 0.30.7 · (−0.01) / (normalization)
≈ 0.013

The linear approximation is very close. The error is on the order of (dot product)3, which in 300 dimensions is around 10−6. This is why superposition works so well in practice — the approximation error is negligible in high dimensions.

When does the proof break? The linear approximation fails when the dot products are large — which happens when the senses are very similar (nearly parallel vectors). If vs1 · vw' is, say, 0.8 (because w' is closely related to sense 1), then exp(0.8) ≈ 2.23 and the linearization is poor. However, these large-dot-product pairs are rare in the vocabulary, so the overall fit remains good.
Log-Sum-Exp Linearization

The log-sum-exp of two values. In the regime where both values are small (high-dimensional concentration), it behaves nearly linearly. Drag to change the arguments and see when the approximation breaks down.

x10.5
x2-0.5
What mathematical property of high-dimensional spaces makes the linear superposition of senses possible?

Chapter 5: Recovering Senses

We've proven that the word vector is a linear combination of sense vectors. Now the practical question: given just the word vectors (no sense labels), can we actually recover the individual sense vectors?

The sparse coding approach

The paper proposes using sparse coding to decompose word vectors into atomic "sense atoms." The idea is elegant: most word vectors should be expressible as a sparse combination of a small set of basis vectors (the "atoms"). Each atom represents a basic semantic unit — what the paper calls a sememe.

Sememes — atoms of meaning: Just as atoms are the building blocks of matter, sememes are hypothesized to be the building blocks of meaning. The word "bank" (financial) might be composed of sememes for {institution, money, storage}. The word "bank" (river) might be composed of {geography, water, edge}. Different senses of the same word use completely different sets of sememes. This is why sparsity is the right prior: each word uses only a handful of sememes out of the full dictionary, and polysemous words use different sememes for different senses.

The algorithm

The sparse coding objective finds a dictionary of atoms A = [a1, a2, ..., am] and sparse coefficient vectors xw such that:

vw ≈ A xw = ∑j xw,j aj

subject to each xw being sparse (most entries near zero). The optimization problem is:

minA, {xw}w || vw − A xw ||22 + λ ∑w || xw ||1

The first term is the reconstruction error — how well can we reconstruct each word vector from its atoms? The second term is the sparsity penalty — the L1 norm encourages most coefficients to be exactly zero. The parameter λ controls the tradeoff: larger λ means sparser (fewer active atoms per word), smaller λ means denser (more active atoms, better reconstruction).

This is solved by alternating minimization:

Initialize
Random dictionary A of m atoms
Sparse code
For each word w, find sparse xw minimizing ||vw − Axw||2 + λ||xw||1
Update dict
Update A to minimize reconstruction error given all xw
↻ repeat

From atoms to senses

Once we have the sparse decomposition, how do we go from atoms to senses? Each word w has a sparse coefficient vector xw with, say, 5 nonzero entries. For a polysemous word, these nonzero entries naturally cluster into groups — one group per sense. The atoms used by "bank" (financial) are different from the atoms used by "bank" (river).

The paper identifies senses by looking at which atoms have the largest coefficients and clustering them. For a word with two well-separated senses, the atoms cleanly partition into two groups, and each group's weighted sum gives the sense vector.

Concrete walkthrough

Suppose sparse coding gives us this decomposition for "bank":

vbank = 0.42 ainstitution + 0.31 amoney + 0.28 astorage + 0.18 awater + 0.15 aedge + 0.12 anature

The first three atoms (institution, money, storage) are semantically related — they form a cluster. The last three (water, edge, nature) form another cluster. We identify two senses:

vfinancial = 0.42 ainstitution + 0.31 amoney + 0.28 astorage
vriver = 0.18 awater + 0.15 aedge + 0.12 anature

The relative magnitudes tell us that the financial sense dominates (larger coefficients), consistent with "bank" appearing more often in financial contexts in the training corpus.

For a monosemous word like "guitar," all active atoms belong to one coherent cluster (music, instrument, strings, acoustic). No separation is needed — and the algorithm correctly doesn't produce multiple senses.

The sememe hypothesis

The paper's use of sparse coding connects to a long-standing hypothesis in linguistics: that meanings are built from sememes — atomic units of meaning that combine to form complex concepts.

In traditional linguistics, a sememe is the smallest unit of meaning that cannot be further decomposed. For example, the concept "bachelor" might be composed of the sememes {male, unmarried, adult}. "Spinster" would be {female, unmarried, adult}. These share two of three sememes, explaining why they're semantically similar despite referring to different genders.

The sparse coding atoms ARE computational sememes. Each atom is a direction in embedding space that represents one atomic semantic concept. Words are composed of small numbers of these atoms, and the composition is linear (just addition of weighted directions). This is a computational vindication of the sememe hypothesis: the atoms learned by sparse coding correspond to interpretable semantic units.

Interpretable atoms: When researchers inspect the learned atoms, they find semantically coherent groupings. One atom might activate strongly for all words related to "cooking" (bake, fry, roast, sauté, braise). Another activates for "legal proceedings" (trial, verdict, jury, prosecution). A third for "body parts" (arm, leg, shoulder, knee). These are genuine semantic primitives, not statistical artifacts.

Connection to modern sparse autoencoders (SAEs)

In 2023, researchers at Anthropic and elsewhere began applying the same idea — sparse decomposition of learned representations — to the hidden states of large language models. They train sparse autoencoders (SAEs) that decompose each residual stream vector into a sparse combination of learned features:

h = ∑i ci fi

where h is the hidden state, fi are learned feature directions, and ci are sparse coefficients. This is mathematically identical to Arora et al.'s sparse coding of word vectors. The features learned by SAEs are interpretable — one might fire for "mentions of the Golden Gate Bridge," another for "Python code syntax," another for "sarcasm." They are the sememes of the neural network's internal language.

python
# Sparse autoencoder: same idea as sparse coding for word vectors,
# but applied to transformer hidden states
import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model, n_features, sparsity_coeff=1e-3):
        super().__init__()
        self.encoder = nn.Linear(d_model, n_features)  # overcomplete
        self.decoder = nn.Linear(n_features, d_model)
        self.sparsity_coeff = sparsity_coeff

    def forward(self, h):
        # h: (batch, d_model) hidden states
        c = torch.relu(self.encoder(h))  # sparse coefficients
        h_hat = self.decoder(c)            # reconstruction
        # Loss = reconstruction + L1 sparsity
        loss = (h - h_hat).pow(2).mean() + \
               self.sparsity_coeff * c.abs().mean()
        return h_hat, c, loss

# Same math as Arora et al.'s sparse coding!
# encoder.weight rows = atoms (sememes)
# c = sparse coefficients per input

Why sparsity is essential

You might ask: why not just use PCA or standard matrix factorization? The answer is that PCA gives you dense coefficients — every atom contributes to every word. This makes it impossible to cluster atoms into sense groups, because every sense blends into every other sense. Sparsity is what creates the clean separation.

Think of it like mixing paint colors. If you mix every color a little bit to get a word's representation (dense coding), you end up with brown — indistinguishable. If each word uses only a handful of pure colors (sparse coding), you can identify the original pigments. The sememe hypothesis says that meanings ARE built from a sparse set of semantic "pigments."

Mathematically, the advantage of sparsity comes from the overcomplete dictionary. In PCA, you have exactly d basis vectors for d dimensions — a complete but minimal basis. Every vector must be expressed using ALL basis vectors (dense). In sparse coding, you have m >> d dictionary atoms — far more atoms than dimensions. This redundancy means each word can "choose" a small subset of atoms that best match its meaning, ignoring the rest. Different senses choose different subsets, creating the clean separation we need.

Convergence guarantees

The paper proves that the alternating minimization algorithm converges to a local minimum of the objective under mild conditions on the dictionary. In practice, convergence takes 50-200 iterations and produces dictionaries that are reproducible across random initializations — the solution is robust, not a fragile local minimum. Running the algorithm on standard word2vec embeddings (300 dimensions, 50K vocabulary) takes about 10 minutes on a laptop. No GPU needed.

Choosing the number of atoms

The dictionary size m (number of atoms) is a hyperparameter. The paper finds that m should be significantly larger than the vocabulary size — typically m ≈ 2000 for a 50K-word vocabulary. This overcomplete dictionary ensures there are enough atoms to capture fine-grained semantic distinctions. If m is too small, different senses are forced to share atoms. If m is too large, atoms become redundant.

Quantitative results

On standard word sense benchmarks (SCWS — Stanford Contextual Word Similarity), the recovered sense vectors outperform the original single vectors. For highly polysemous words (3+ senses), the improvement is 8-12% in Spearman correlation with human similarity judgments.

MethodSCWS ρWS-353 ρSimLex ρ
word2vec (single)0.6570.6540.370
Huang et al. (multi)0.6280.6310.355
Arora (sparse recovery)0.6930.6680.381

The sparse recovery approach outperforms both the original single vectors AND the multi-prototype approach of Huang et al., despite using the same pre-trained embeddings as input. No retraining needed.

The improvement is most dramatic for words with well-separated senses. For "bank" (financial vs. river), the similarity between "bank" and "money" jumps from 0.48 (using the single combined vector) to 0.82 (using the recovered financial sense vector). The similarity between "bank" and "river" simultaneously jumps from 0.33 to 0.79 (using the recovered river sense vector). By decomposing the combined vector, we get much more accurate similarity judgments for each individual sense.

For words with related senses (like "bright" meaning luminous vs. intelligent), the improvement is smaller — about 3-5% — because the senses share semantic features and are harder to separate. This matches the theory: related senses have correlated vectors, making decomposition harder.

Practical value: The entire pipeline (load pre-trained word vectors, run sparse coding, cluster atoms into senses) takes about 15 minutes for a 50K-word vocabulary on a laptop CPU. No GPU needed, no retraining of embeddings. The output is a set of sense vectors for every polysemous word in the vocabulary, ready for use in downstream NLP tasks.
Sparse Coding Decomposition

A word vector decomposed into sparse atoms. The large coefficients cluster into distinct sense groups. Click "Decompose" to run sparse coding on a random polysemous word.

Click Decompose to begin
python
import numpy as np
from sklearn.decomposition import DictionaryLearning

# Learn sparse dictionary from word vectors
# word_vectors: (n_words, dim) array of word embeddings
dl = DictionaryLearning(
    n_components=2000,     # number of atoms (sememes)
    alpha=1.0,              # sparsity penalty
    max_iter=100,
    fit_algorithm='cd',    # coordinate descent
    transform_algorithm='lasso_cd'
)
# Fit dictionary and get sparse codes
sparse_codes = dl.fit_transform(word_vectors)
dictionary = dl.components_  # (2000, dim) atom vectors

# For a polysemous word, find its nonzero atoms
word_idx = vocab["bank"]
coeffs = sparse_codes[word_idx]  # (2000,)
active = np.where(np.abs(coeffs) > 0.1)[0]
print(f"Active atoms: {len(active)}")  # typically 3-8

# Sense vectors = cluster atoms by semantic similarity
# then sum weighted atoms within each cluster
What is the role of sparsity in recovering sense vectors from polysemous word embeddings?

Chapter 6: Sense Explorer

Let's put it all together. This interactive simulation demonstrates the full pipeline: a polysemous word is embedded as a superposition of sense vectors, and sparse coding recovers the individual senses.

Interactive Sense Decomposition

Explore polysemous words in 2D embedding space. Each word has 2-3 senses (colored dots) that combine into a single vector (white). Use sparse recovery to separate them. Toggle atoms on/off. Drag the noise slider to see when recovery fails.

Noise0.05
Sparsity λ1.0
What to observe:
  • Low noise, moderate sparsity: Recovery works perfectly — the recovered sense vectors land right on the true senses.
  • High noise: The combined vector drifts away from the true superposition, and recovery degrades.
  • Too much sparsity (λ too high): All coefficients get pushed to zero — the decomposition collapses.
  • Too little sparsity (λ too low): Many atoms activate, senses blur together.

The full pipeline in action

Let's trace the complete sense recovery pipeline step by step for the word "crane" (which can mean a bird or a construction machine):

  1. Input: The single word2vec vector for "crane," trained on 1B tokens of English text.
  2. Dictionary learning: Run sparse coding on the entire vocabulary (50K vectors) to learn a dictionary of 2000 atoms.
  3. Sparse decomposition: Express vcrane as a sparse combination of atoms. Result: 6 atoms with nonzero coefficients.
  4. Sense clustering: Three atoms (bird, flying, wildlife) cluster together. Three atoms (construction, machine, lifting) cluster together. Two distinct sense groups.
  5. Sense vectors: Sum the atoms within each cluster to get vbird and vmachine.
  6. Verification: cos(vbird, "heron") = 0.78 (high — correct sense). cos(vmachine, "excavator") = 0.72 (high — correct sense). cos(vcrane-original, "heron") = 0.42 (lower — mixed signal).

The decomposition correctly separates the two unrelated senses, and the recovered sense vectors are much more useful for sense-specific tasks.

Reading the visualization

In the explorer above, each scenario shows:

When does recovery fail?

The paper identifies three failure modes:

  1. Correlated senses: If two senses are semantically similar (e.g., "run" meaning "jog" vs. "sprint"), their sense vectors point in nearly the same direction. The combined vector is barely different from either sense, and there's nothing to decompose.
  2. Rare senses: If one sense appears in only 1% of occurrences, its contribution to the combined vector is tiny. It's drowned out by noise and nearly impossible to recover.
  3. Many senses: As the number of senses grows, the superposition becomes crowded. With k senses in d dimensions, reliable recovery requires d >> k (which is usually satisfied — 300 >> 5).

Recovery quality by word type

The paper reports recovery quality across different categories of polysemous words:

Word typeExampleSensesRecovery R2
Homonyms (unrelated)bank, bat, crane20.92-0.97
Moderate polysemyrun, set, play3-50.85-0.93
Related sensesbright (light/smart)20.70-0.85
Systematic polysemychicken (animal/food)20.80-0.90
Rare secondary sensemouse (animal vs computer)2 (skewed)0.60-0.80

The best recovery happens for homonyms — words whose senses are completely unrelated. This makes sense: unrelated senses have orthogonal vectors, making them easy to separate. The hardest cases are related senses (where the vectors point in similar directions) and rare senses (where the signal is weak).

python
# Evaluate recovery quality for a polysemous word
import numpy as np

def recovery_quality(v_word, true_senses, recovered_senses):
    """R² between true and recovered sense vectors."""
    # Match recovered to true senses (greedy by cosine sim)
    matched = []
    remaining = list(range(len(true_senses)))
    for r in recovered_senses:
        sims = [np.dot(r, true_senses[j]) / (
            np.linalg.norm(r) * np.linalg.norm(true_senses[j]))
            for j in remaining]
        best = remaining[np.argmax(sims)]
        matched.append((r, true_senses[best]))
        remaining.remove(best)

    # Compute average cosine similarity
    avg_cos = np.mean([np.dot(r, t) / (
        np.linalg.norm(r) * np.linalg.norm(t))
        for r, t in matched])
    return avg_cos  # Higher = better recovery
Which scenario makes sense recovery HARDEST?

Chapter 7: Connections

What this paper built on

Word2vec (Mikolov et al., 2013): The embedding algorithm whose vectors this paper analyzes. Word2vec trains by predicting co-occurring words using shallow neural networks (skip-gram or CBOW).

GloVe (Pennington et al., 2014): An alternative embedding algorithm that explicitly factorizes the co-occurrence matrix. The superposition result applies to both word2vec and GloVe, because both algorithms implicitly perform the same matrix factorization — the only difference is in how they approximate the optimization. This universality is a strong point: the superposition property is a consequence of the co-occurrence structure of language, not the details of the training algorithm.

A latent variable model for NLP (Arora et al., 2016): The random walk discourse model from the same group. This earlier paper introduced the generative model that makes the superposition proof possible. The polysemy paper is essentially a corollary of the discourse model. Without the random walk framework, there would be no principled way to connect the generative process of text to the structure of word vectors — and no way to prove that superposition arises naturally from co-occurrence statistics.

The Johnson-Lindenstrauss lemma (1984): The classical result that random projections approximately preserve distances in high dimensions. This is the mathematical foundation for why random vectors are nearly orthogonal — and therefore why the linear approximation in the superposition proof works. The lemma states that n points in high dimensions can be projected to d = O(log n / ε2) dimensions with all pairwise distances preserved up to a (1 ± ε) factor. For word embeddings, this means d = 300 dimensions can preserve distances between up to n = exp(300) ≈ 10130 points — far more than any vocabulary.

Sparse coding (Olshausen & Field, 1996): The computational tool for recovering sense vectors. Originally developed for understanding V1 neurons in visual cortex — neurons respond to oriented edges, which form a sparse, overcomplete representation of visual scenes. The analogy is fitting and deep: both brains and word embeddings seem to use sparse superposition to pack multiple meanings into limited capacity.

Sememes in computational linguistics (Dong & Dong, 2003): The HowNet project in Chinese NLP manually annotated sememes (atomic semantic units) for over 100K Chinese and English words. This paper provides a computational analog — the sparse coding atoms correspond to automatically discovered sememes, bypassing the need for manual annotation.

What this paper enabled

Superposition hypothesis in neural networks (Elhage et al., 2022): The idea that neural network activations represent many features simultaneously through superposition is a direct descendant of this work. The mechanisms paper at Anthropic explicitly cites the analogy: neurons represent features as sparse linear combinations, just as word vectors represent senses as sparse linear combinations.

Sparse autoencoders for interpretability (Cunningham et al., 2023; Bricken et al., 2023): The use of sparse coding to extract interpretable features from neural network activations is methodologically identical to extracting sense vectors from word embeddings. Same math, different scale.

Contextualized embeddings (ELMo, BERT): These models produce different vectors for each occurrence of a word, effectively solving the polysemy problem by construction. In BERT, the vector for "bank" in "river bank" is completely different from "bank" in "bank account." This is the architectural solution to polysemy — instead of storing a superposition and recovering senses, generate sense-specific vectors on the fly using context.

But understanding WHY static embeddings superpose senses illuminates what contextual models are learning to separate. BERT's self-attention mechanism can be viewed as a dynamic sense disambiguation system: it reads the context, determines which sense is active, and generates a vector that matches that specific sense. The "sense vectors" that Arora et al. recover from word2vec are approximations to what BERT computes dynamically for each occurrence.

Polysemous embeddings (Athiwaratkun & Wilson, 2017): Instead of point vectors, represent each word as a Gaussian mixture in embedding space — one Gaussian per sense. This is an explicit probabilistic version of the linear superposition. The connection is direct: the mixture component means are the sense vectors, and the mixture weights are the sense frequencies. Arora et al.'s result explains WHY this probabilistic model works — because the point-estimate word vector already encodes the mixture structure linearly.

Word sense induction (Reisinger & Mooney, 2010): Early work on automatically discovering word senses from corpus statistics. Unlike Arora et al., this approach requires retraining embeddings with a sense-clustering step integrated into the training loop. The sparse coding approach is post-hoc — it works on any pre-trained vectors.

Implications for modern models

The superposition result has profound implications beyond word vectors. If meanings superpose linearly in 300-dimensional word2vec, what about the 12,288-dimensional hidden states of GPT-4? The same mathematical structure — high dimensions creating near-orthogonality, enabling many features to coexist in one vector — likely operates at massive scale in modern transformers. The difference is that transformers learn to dynamically compose and decompose these superpositions through attention, while word2vec just stores the static average.

The numbers are striking. In 300 dimensions, you can pack roughly d2/2 ≈ 45,000 nearly-orthogonal directions (this is a consequence of the Johnson-Lindenstrauss lemma). That's more than enough for a vocabulary of 50K words. In 12,288 dimensions, you can fit over 75 million nearly-orthogonal features. This explains why modern transformers can represent such complex, multifaceted concepts — the representational capacity grows quadratically with dimension.

Implementation: sense-aware similarity

Here's a practical application. Given a polysemous word and a context sentence, you can compute which sense is active by projecting the context vector onto each recovered sense direction:

python
def disambiguate(word, context_words, sense_vecs, embeddings):
    """Which sense of `word` is active given context?"""
    # Average context word vectors
    ctx_vec = np.mean([embeddings[w] for w in context_words
                       if w in embeddings], axis=0)

    # Project onto each sense direction
    scores = []
    for sv in sense_vecs:
        cos = np.dot(ctx_vec, sv) / (
            np.linalg.norm(ctx_vec) * np.linalg.norm(sv))
        scores.append(cos)

    return np.argmax(scores)  # index of best-matching sense

# Example: "I deposited money at the bank"
# context = ["deposited", "money"]
# Returns sense 0 (financial) with high confidence
The deeper message: Linear superposition isn't a bug of word embeddings — it's a fundamental principle of how high-dimensional representations encode multiple meanings. The same principle appears in quantum mechanics (superposition of states), compressed sensing (sparse signal recovery), and neural coding (population coding in the brain). Arora et al.'s contribution was showing it arises naturally from the statistics of language.

Timeline of related work

YearWorkContribution
2012Huang et al.Multi-prototype embeddings: multiple vectors per word (needs retraining)
2013Mikolov et al. (word2vec)Efficient training of word embeddings; popularized d = 300
2014Pennington et al. (GloVe)Co-occurrence factorization; same d = 300 convention
2016Arora et al.Random walk discourse model; word vectors as latent variable model
2018This paperSuperposition of word senses; sparse coding recovery algorithm
2018ELMo (Peters et al.)Contextualized embeddings; different vector per occurrence
2019BERT (Devlin et al.)Deep bidirectional context; polysemy handled by architecture
2022Elhage et al. (Anthropic)Superposition hypothesis in neural networks; sparse autoencoders
2023Bricken et al. (Anthropic)Monosemantic features via dictionary learning (SAEs)

The intellectual line from "word senses superpose linearly" to "neural network features superpose linearly" is direct and unbroken. The tools changed — from sparse coding on 300-dimensional word vectors to sparse autoencoders on million-dimensional transformer activations — but the mathematical principle is identical.

Why this paper matters for AI safety

The connection to interpretability is not just intellectual. If features in neural networks are stored as superpositions (as the evidence strongly suggests), then understanding the math of superposition is critical for AI safety. Arora et al.'s work provides foundational theory for:

Limitations

Several limitations of the paper deserve mention:

Recommended reading order

To fully understand this paper and its context, read in this order:

  1. Mikolov et al. (2013) — word2vec: Understand what word vectors are and why they encode meaning.
  2. Levy & Goldberg (2014) — word2vec as matrix factorization: Understand why word2vec is implicitly doing linear algebra.
  3. Arora et al. (2016) — random walk model: Understand the generative model that produces word vectors with their observed structure.
  4. This paper (2018) — superposition and recovery: The polysemy result follows naturally from the previous three.
  5. Elhage et al. (2022) — superposition in neural nets: See how the same principle applies to modern deep networks.

What would have happened differently?

If Arora et al.'s paper had been published in 2022 instead of 2018, it would have been framed very differently — as a foundational result for the superposition hypothesis in mechanistic interpretability. The mathematical tools (sparse coding, dictionary learning, overcomplete representations) are identical to what the interpretability community independently developed. The delay in making this connection cost the field several years of potential progress.

The lesson: theoretical work in one subfield (word embeddings) can have unexpected applications in another (neural network interpretability). The math of superposition is universal — it doesn't care whether you're decomposing a 300-dimensional word vector or a 12,288-dimensional transformer hidden state. The principles are the same. The tools are the same (sparse coding / sparse autoencoders). Even the failure modes are the same (correlated features, rare features, overcrowded representations).

Key numbers to remember

QuantityTypical valueWhat it means
R2 of linear fit0.82 - 0.97Superposition explains >80% of variance
Exponent ζ0.5 - 0.8Sublinear: rare senses get more than proportional weight
Active atoms per word3 - 8Strong sparsity; basis for sense clustering
Dictionary size m~2000Overcomplete: 4-7x the embedding dimension
SCWS improvement+8-12%Practical gain on word sense benchmarks
Computation time~15 min (CPU)No GPU needed for sparse coding on 50K words

Cheat sheet

Core equation
vw ≈ ∑i fiζ vsi — word vector is weighted sum of sense vectors
Key insight
High-dimensional concentration makes log-sum-exp nearly linear, enabling superposition
Recovery method
Sparse coding decomposes word vectors into atoms (sememes); atoms cluster into senses
Foundation
Random walk discourse model (Arora et al. 2016) provides the generative story
Legacy
Directly inspired the superposition hypothesis and sparse autoencoders in interpretability research
How does the superposition of word senses relate to modern neural network interpretability?