Polysemous words are linear superpositions of their individual sense vectors — and you can recover each sense with sparse coding.
Consider the word "bank." In "I deposited money at the bank," it means a financial institution. In "We sat on the river bank," it means the edge of a river. These two meanings share nothing semantically — they are completely unrelated concepts that happen to share a spelling.
Now, word embedding algorithms like word2vec and GloVe assign every word exactly one vector. So when you train word2vec on a large corpus, the word "bank" gets a single 300-dimensional vector. What does that vector represent? The financial meaning? The river meaning? Some weird average of both?
This is the polysemy problem: words with multiple meanings get collapsed into a single point in embedding space, losing crucial semantic distinctions.
And "bank" is an easy case — it only has two main senses. Consider "spring": a season (spring flowers), a water source (hot spring), a coiled metal device (spring mechanism), and a verb (spring into action). Or "light": illumination, not heavy, to ignite, pale in color. English is riddled with polysemy. Over 80% of the 500 most common English words have multiple meanings.
The standard response in NLP was to either (a) ignore the problem and hope the single vector is "good enough," or (b) try to learn multiple vectors per word. Option (a) loses information. Option (b) is complex, brittle, and requires knowing the number of senses in advance — should "run" get 5 vectors? 10? 30? There's no good answer.
Arora et al. found a third way: prove that the single vector already contains all the sense information, then extract it mathematically. The key realization is that the single vector isn't a "lossy average" — it's a lossless encoding (up to the limits of the embedding dimension). The different senses are all in there, superposed like radio signals on different frequencies. You just need the right decomposition technique to tune into each one.
The word "bank" appears in two completely different contexts. Toggle between senses to see how a single vector struggles to represent both.
The result is surprising because you might expect the nonlinear training process (sigmoid functions, negative sampling) to create a nonlinear mixing of senses. But the mathematics of high-dimensional spaces comes to the rescue. In 300 dimensions, there's enough "room" for different senses to be nearly orthogonal — and when you average nearly orthogonal vectors, the average encodes both components in a recoverable way.
To see why this is surprising, consider a low-dimensional analogy. Imagine you have two different sounds — a piano note and a drum hit. If you mix them into a single audio recording, can you recover the individual sounds? Only if the sounds are "different enough" (technically, if their frequency spectra don't overlap too much). In high dimensions, ALL vectors are "different enough" — they're all nearly orthogonal to each other. So mixing ANY set of sense vectors into a single word vector preserves enough structure to be unmixed.
This is the same principle that makes compressed sensing work. In compressed sensing, you take a sparse signal (one with only a few nonzero components), project it to a lower dimension, and can still recover the original signal exactly. The reason: in high dimensions, random projections preserve sparse structure.
Arora et al.'s superposition is the embedding-space analog: each word vector is a "compressed representation" of its component senses, and sparsity (few senses per word) makes recovery possible. The L1 penalty in the sparse coding objective is the same L1 penalty used in LASSO regression and basis pursuit in compressed sensing. The algorithms are identical; only the domain changes.
This universality is what makes the result so powerful. It connects word embeddings to a vast body of mathematical theory about sparse representations, overcomplete dictionaries, and recovery guarantees. Any improvement in sparse recovery algorithms (and there have been many since 2018) directly improves sense vector recovery.
This isn't just a curiosity. If you can decompose a polysemous word vector into its individual senses, you can build better NLP systems that understand which meaning is intended in context. And the fact that superposition works at all reveals something deep about how word embeddings organize meaning.
Consider more examples. The word "run" has over 30 senses in English: run a race, run a company, run a program, run water, a run in stockings. The word "set" has over 400 senses in some dictionaries — holding a record for polysemy. Each of these senses has its own semantic neighborhood, its own set of related words. Yet word2vec gives each word just one vector.
And this problem runs deeper than just individual words. Metonymy ("Washington announced new sanctions" — Washington = the US government) and metaphor ("a river of tears," "a flood of information") create context-dependent meanings for thousands of words. Even common words like "head" (body part, leader, top of a list, source of a river) carry multiple meanings that a single vector conflates.
How bad is this problem in practice? A study of the top 10,000 most frequent English words found that over 60% are polysemous. The 500 most frequent words average 7.3 senses each. Polysemy isn't a rare edge case — it's the norm. Any theory of word embeddings that ignores polysemy is ignoring the majority of the vocabulary.
The problem is even more acute for downstream tasks. In word sense disambiguation (WSD), the system must determine which sense of a polysemous word is intended in a given sentence. If the word vector doesn't encode sense distinctions, WSD performance degrades. In information retrieval, searching for "java" (the programming language) brings up results about Indonesian islands. In machine translation, "bank" might be translated incorrectly if the wrong sense is assumed. Polysemy is a practical bottleneck, not just a theoretical curiosity.
Before we can understand why superposition works, we need to understand what a word vector is and why it captures meaning at all.
Word2vec and GloVe learn vectors by looking at co-occurrence patterns. Words that appear in similar contexts get similar vectors. "King" and "queen" appear near words like "royal," "throne," "crown" — so their vectors end up close together. "King" and "bicycle" appear in very different contexts — so their vectors are far apart.
The remarkable thing is that these vectors capture relational structure. The famous example: vec("king") - vec("man") + vec("woman") ≈ vec("queen"). The difference between "king" and "man" encodes the concept of royalty, and adding it to "woman" lands you near "queen." This means directions in embedding space correspond to semantic relationships.
Other examples of linear structure: vec("Paris") - vec("France") + vec("Japan") ≈ vec("Tokyo"). vec("walking") - vec("walk") + vec("swim") ≈ vec("swimming"). These relationships hold because the co-occurrence statistics of "Paris" relative to "France" mirror those of "Tokyo" relative to "Japan." The vector arithmetic captures regularities in how words relate to each other.
But here's the tension: if word vectors are so good at capturing meaning through linear structure, how do they handle the fundamentally non-linear phenomenon of polysemy? A word with two unrelated meanings shouldn't satisfy any simple linear relationship — yet it does. Understanding why requires diving into the mathematics of high-dimensional geometry.
The superposition property seems like something someone would have noticed earlier. But there were two obstacles:
But consider "bank." It co-occurs with "money," "account," "deposit" (financial sense) AND with "river," "water," "fishing" (river sense). The word2vec training procedure sees ALL these contexts and produces a single vector that tries to satisfy both sets of relationships simultaneously.
To see why a single vector is problematic, try this thought experiment. Take the word "cell" (biology: a living cell; technology: a cell phone; prison: a jail cell). Compute its nearest neighbors in word2vec:
The neighbors are a mixed bag from all three senses. This means any task that relies on nearest neighbors — information retrieval, word sense disambiguation, analogies — gets confused. You can't tell which meaning of "cell" is relevant without decomposing the vector.
python import gensim.downloader as api # Load pre-trained word2vec vectors model = api.load('word2vec-google-news-300') # Nearest neighbors for "bank" — a mix of senses for word, sim in model.most_similar('bank', topn=10): print(f" {word:15} {sim:.3f}") # Output mixes financial ("banking", "lender") and river ("riverbank") # The vector for "bank" is somewhere between both senses # cos(bank, money) ≈ 0.48, cos(bank, river) ≈ 0.33 print(model.similarity('bank', 'money')) # 0.48 print(model.similarity('bank', 'river')) # 0.33
Imagine the word "bank" has two senses with vectors v1 (financial) and v2 (river). If the combined vector vbank is a linear combination:
where α1 and α2 are weights proportional to how often each sense appears in the training corpus, then we might be able to decompose vbank to recover v1 and v2 separately. But first, we need to prove that this linear relationship actually holds.
Each word sense has a direction in embedding space. The combined vector for a polysemous word is a weighted sum of its sense directions. Drag the weight slider to change the sense balance.
A polysemous word (white) has neighbors from BOTH senses — see how the neighbor list mixes unrelated concepts. Click "Randomize" to try different positions.
Here is the central claim of the paper, stated precisely:
Several things to unpack here:
The combination is a plain weighted sum — no activation functions, no complicated transforms. This is surprising. You might expect the word2vec training objective (which involves sigmoid functions and dot products) to produce some nonlinear mixing of senses. But it doesn't.
The weights are not arbitrary. They depend on how often each sense appears. If 90% of "bank" occurrences are financial, then α1 is much larger than α2, and vbank is much closer to the financial sense vector.
The weights are not simply the frequencies — they are frequencies raised to a power ζ between 0 and 1. This is a sublinear weighting. Why? Because word2vec's implicit weighting of co-occurrence statistics isn't linear in frequency. The paper shows ζ depends on the specific embedding algorithm (word2vec vs GloVe) and its hyperparameters.
The sense vector vsi is defined as the vector the sense would get if we could train word2vec on a corpus where every occurrence of "bank" was replaced by "bank_financial" or "bank_river" depending on the intended sense. These hypothetical vectors are well-defined even though we never explicitly train them.
This is a subtle but critical point. We're NOT saying you need to manually disambiguate the corpus. The claim is that vbank can be decomposed into components that match what you'd get from a disambiguated corpus. The sense vectors exist as latent directions within the embedding space — we just need a way to find them.
How do you verify that superposition actually holds? The paper uses a clever trick. Take a polysemous word like "bank." Find all sentences in the corpus containing "bank" and manually label them as financial or river. Compute the average context vector for each sense (the average of neighboring word vectors). These context-based sense vectors approximate the "would have" vectors vs1 and vs2.
Then check: is vbank ≈ α1 vs1 + α2 vs2? The paper finds this holds with cosine similarity > 0.95 for most polysemous words tested. The linear model is an excellent fit.
For different polysemous words, the cosine similarity between the actual word vector and the best linear combination of sense vectors. Values above 0.90 indicate the superposition model fits well.
python # Verify superposition empirically import numpy as np # v_bank: the actual word2vec vector for "bank" # v_s1: average context vector for financial sense # v_s2: average context vector for river sense # f1, f2: frequency of each sense in corpus def verify_superposition(v_word, sense_vecs, freqs, zeta=0.7): """Check if v_word ≈ sum(alpha_i * v_si)""" alphas = np.array([f**zeta for f in freqs]) alphas /= alphas.sum() # normalize # Weighted sum of sense vectors v_combined = sum(a * v for a, v in zip(alphas, sense_vecs)) # Cosine similarity with actual word vector cos_sim = np.dot(v_word, v_combined) / ( np.linalg.norm(v_word) * np.linalg.norm(v_combined)) return cos_sim, alphas # Typical result: cosine_sim > 0.95
A polysemous word with 3 senses. Each sense has a direction (colored arrows). The word vector (white) is their weighted sum. Drag frequency sliders to change the mix.
The paper verifies this on real word vectors. For words with known senses (from WordNet), they compute the cosine similarity between vw and the best linear combination of sense vectors estimated from sense-tagged corpora. The fit is strong — typically r > 0.9. The linear model explains over 80% of the variance in the word vector.
The verification procedure works as follows:
The results are consistently strong across different embedding algorithms (word2vec CBOW, word2vec skip-gram, GloVe), different vector dimensions (100, 200, 300), and different corpora (Wikipedia, Google News, Common Crawl). The superposition relationship is robust — it's not an artifact of a particular training setup. Even negative sampling rate (k = 5 vs k = 15) and window size (5 vs 10) don't meaningfully change the quality of the linear fit. This universality strongly supports the theoretical explanation based on high-dimensional geometry rather than algorithm-specific properties.
The exponent ζ in the formula αi = fiζ has a value around 0.6-0.8 for word2vec and 0.5-0.7 for GloVe. What does this mean intuitively?
If ζ = 1, the weights would be proportional to the raw frequencies — a 90% dominant sense would contribute 90% to the word vector. If ζ = 0, all senses would contribute equally regardless of frequency.
The fact that ζ ∈ (0.5, 0.8) means the combination is frequency-weighted but democratized. A rare sense (10% frequency) contributes more than 10% but less than an equal share. Specifically, with ζ = 0.7: a sense with frequency 0.1 gets weight 0.10.7 ≈ 0.20 (twice its frequency share), while a sense with frequency 0.9 gets weight 0.90.7 ≈ 0.93 (slightly less than its frequency share). This "compression" of the frequency distribution is what makes rare senses recoverable — they get more representation in the word vector than their raw frequency would suggest.
This is actually good news for sense recovery. If rare senses were completely dominated by frequent senses (ζ = 1), they'd be invisible in the word vector. The sublinear weighting (ζ < 1) ensures that even a 5% sense leaves a detectable trace.
The superposition result doesn't appear out of thin air. It follows from a specific theoretical model of how text is generated and how word vectors arise from that generative process. This model was introduced in an earlier paper by Arora et al. (2016) and forms the mathematical backbone of the polysemy result.
Imagine you're writing a document. At each moment, there's a discourse vector ct — a vector in the same space as word embeddings — that represents "what you're currently talking about." When you're writing about finance, ct points in the direction of financial concepts. When you're describing nature, ct shifts toward nature-related directions.
The discourse vector evolves over time as a random walk:
where ηt is a small random perturbation. This means the topic drifts slowly and smoothly — consecutive words tend to be about the same thing, but the topic gradually shifts over the course of a document. This matches how real text works.
At each position t, the probability of emitting word w is:
Words whose vectors align with the current discourse vector are more likely to appear. If ct points toward finance, words like "deposit," "account," "interest" have high dot product with ct and are likely to appear.
The model implies that the pointwise mutual information (PMI) between two words relates to their vector dot product:
where d is the embedding dimension. This is exactly the relationship that GloVe and word2vec implicitly optimize. The random walk model thus unifies different embedding algorithms under one theoretical roof.
The random walk model captures three key properties of real text:
The key insight for polysemy: when the random walk visits the "financial district" of the embedding space, it generates financial-sense occurrences of "bank." When it visits the "nature" region, it generates river-sense occurrences. The word vector for "bank" is the average of all these positions, weighted by how often the walk visits each region. This is exactly the superposition formula.
An important parameter is the step size of the random walk. If the walk takes tiny steps, consecutive words are about exactly the same topic — strong local coherence but slow topic changes. If the walk takes large steps, topics change rapidly — weak coherence but more diverse co-occurrence patterns.
The step size determines the context window of the embedding algorithm. Word2vec uses a context window of 5-10 words. Within this window, the discourse vector hasn't moved much, so all words share approximately the same context. Across windows, the discourse vector has drifted, creating the contrastive signal that makes different words distinguishable.
For polysemy, the mixing time determines how often the walk transitions between sense regions. If "bank" appears in a financial document, the walk stays in the financial region for many paragraphs before potentially wandering toward a river context. The relative time spent in each region determines the sense frequencies f1, f2.
Watch the discourse vector (white dot) random-walk through embedding space. Words nearby the discourse vector get emitted. Click "Step" to advance, "Reset" to start over.
Now we can prove why the word vector for a polysemous word is a linear combination of its sense vectors. The proof follows directly from the random walk model, and understanding it reveals exactly when and why superposition works.
Suppose word w has k senses. In the corpus, fraction fi of occurrences of w use sense i. Each sense i has its own "sense vector" vsi — the vector w would get if only sense-i occurrences existed.
The key observation is that when word w co-occurs with some other word w', it's because one particular sense of w is active in that context. If "bank" appears next to "deposit," the financial sense is active. If "bank" appears next to "river," the river sense is active. So the total co-occurrence count decomposes by sense:
In the random walk model, the probability of w and w' co-occurring when sense i is active depends on vsi · vw'. So the total log co-occurrence rate is:
This is where the key mathematical step happens. We need to show that the function log(∑ fi exp(xi)) is approximately linear in the xi values. Why would a clearly nonlinear function (log of sum of exponentials) behave linearly?
The answer lies in the regime we're operating in. In high dimensions (d = 300), the dot product vsi · vw' is typically small — on the order of 1/√d ≈ 0.06. When the arguments to exp() are this small, exp(x) ≈ 1 + x + x2/2. The quadratic and higher terms are negligible, making the whole expression approximately linear.
Formally, the paper proves a lemma: when the xi values have variance σ2 that is small (specifically, σ2 · max|xi| << 1), then:
where ζ = 1 − σ2/2 + O(σ4). In the high-dimensional regime, ζ ∈ (0.5, 1.0) typically. The exponent ζ corrects for the curvature of the exponential — it's a first-order adjustment that makes the linear approximation much more accurate.
Why are dot products small in high dimensions? This is one of the most important facts in high-dimensional geometry, and it's counterintuitive. In 2D, two random unit vectors can easily have a dot product of 0.7 or 0.8. In 300D, the dot product between two random unit vectors is almost always between −0.1 and +0.1.
Formally, if v and w are independent random vectors on the unit sphere in Rd, then:
In d = 300, the standard deviation of the dot product is 1/√300 ≈ 0.058. So most dot products are within 2 standard deviations of zero: between −0.12 and +0.12. This means the arguments to exp() in the log-sum-exp are small, and the linear approximation is excellent.
This is why superposition ONLY works in high dimensions. In d = 2 or d = 10, the dot products would be large, the linear approximation would fail, and polysemous word vectors would not be recoverable superpositions. The high dimensionality of word embeddings (d = 300) is not just a hyperparameter choice — it's a mathematical prerequisite for the superposition structure.
Since PMI(w, w') ≈ vw · vw' / d for any w', and we showed the PMI decomposes as a weighted sum of sense-specific terms, we can conclude:
where αi ∝ fiζ. Since this holds for all w', we conclude:
The word vector is a weighted sum of sense vectors. QED.
Let's trace through the proof with numbers. Suppose "bank" has two senses with frequencies f1 = 0.7 (financial) and f2 = 0.3 (river). In 300 dimensions, a typical dot product vs1 · vw' ≈ 0.02 (small, due to near-orthogonality). So:
Compare with the linear approximation:
The linear approximation is very close. The error is on the order of (dot product)3, which in 300 dimensions is around 10−6. This is why superposition works so well in practice — the approximation error is negligible in high dimensions.
The log-sum-exp of two values. In the regime where both values are small (high-dimensional concentration), it behaves nearly linearly. Drag to change the arguments and see when the approximation breaks down.
We've proven that the word vector is a linear combination of sense vectors. Now the practical question: given just the word vectors (no sense labels), can we actually recover the individual sense vectors?
The paper proposes using sparse coding to decompose word vectors into atomic "sense atoms." The idea is elegant: most word vectors should be expressible as a sparse combination of a small set of basis vectors (the "atoms"). Each atom represents a basic semantic unit — what the paper calls a sememe.
The sparse coding objective finds a dictionary of atoms A = [a1, a2, ..., am] and sparse coefficient vectors xw such that:
subject to each xw being sparse (most entries near zero). The optimization problem is:
The first term is the reconstruction error — how well can we reconstruct each word vector from its atoms? The second term is the sparsity penalty — the L1 norm encourages most coefficients to be exactly zero. The parameter λ controls the tradeoff: larger λ means sparser (fewer active atoms per word), smaller λ means denser (more active atoms, better reconstruction).
This is solved by alternating minimization:
Once we have the sparse decomposition, how do we go from atoms to senses? Each word w has a sparse coefficient vector xw with, say, 5 nonzero entries. For a polysemous word, these nonzero entries naturally cluster into groups — one group per sense. The atoms used by "bank" (financial) are different from the atoms used by "bank" (river).
The paper identifies senses by looking at which atoms have the largest coefficients and clustering them. For a word with two well-separated senses, the atoms cleanly partition into two groups, and each group's weighted sum gives the sense vector.
Suppose sparse coding gives us this decomposition for "bank":
The first three atoms (institution, money, storage) are semantically related — they form a cluster. The last three (water, edge, nature) form another cluster. We identify two senses:
The relative magnitudes tell us that the financial sense dominates (larger coefficients), consistent with "bank" appearing more often in financial contexts in the training corpus.
For a monosemous word like "guitar," all active atoms belong to one coherent cluster (music, instrument, strings, acoustic). No separation is needed — and the algorithm correctly doesn't produce multiple senses.
The paper's use of sparse coding connects to a long-standing hypothesis in linguistics: that meanings are built from sememes — atomic units of meaning that combine to form complex concepts.
In traditional linguistics, a sememe is the smallest unit of meaning that cannot be further decomposed. For example, the concept "bachelor" might be composed of the sememes {male, unmarried, adult}. "Spinster" would be {female, unmarried, adult}. These share two of three sememes, explaining why they're semantically similar despite referring to different genders.
The sparse coding atoms ARE computational sememes. Each atom is a direction in embedding space that represents one atomic semantic concept. Words are composed of small numbers of these atoms, and the composition is linear (just addition of weighted directions). This is a computational vindication of the sememe hypothesis: the atoms learned by sparse coding correspond to interpretable semantic units.
In 2023, researchers at Anthropic and elsewhere began applying the same idea — sparse decomposition of learned representations — to the hidden states of large language models. They train sparse autoencoders (SAEs) that decompose each residual stream vector into a sparse combination of learned features:
where h is the hidden state, fi are learned feature directions, and ci are sparse coefficients. This is mathematically identical to Arora et al.'s sparse coding of word vectors. The features learned by SAEs are interpretable — one might fire for "mentions of the Golden Gate Bridge," another for "Python code syntax," another for "sarcasm." They are the sememes of the neural network's internal language.
python # Sparse autoencoder: same idea as sparse coding for word vectors, # but applied to transformer hidden states import torch import torch.nn as nn class SparseAutoencoder(nn.Module): def __init__(self, d_model, n_features, sparsity_coeff=1e-3): super().__init__() self.encoder = nn.Linear(d_model, n_features) # overcomplete self.decoder = nn.Linear(n_features, d_model) self.sparsity_coeff = sparsity_coeff def forward(self, h): # h: (batch, d_model) hidden states c = torch.relu(self.encoder(h)) # sparse coefficients h_hat = self.decoder(c) # reconstruction # Loss = reconstruction + L1 sparsity loss = (h - h_hat).pow(2).mean() + \ self.sparsity_coeff * c.abs().mean() return h_hat, c, loss # Same math as Arora et al.'s sparse coding! # encoder.weight rows = atoms (sememes) # c = sparse coefficients per input
You might ask: why not just use PCA or standard matrix factorization? The answer is that PCA gives you dense coefficients — every atom contributes to every word. This makes it impossible to cluster atoms into sense groups, because every sense blends into every other sense. Sparsity is what creates the clean separation.
Think of it like mixing paint colors. If you mix every color a little bit to get a word's representation (dense coding), you end up with brown — indistinguishable. If each word uses only a handful of pure colors (sparse coding), you can identify the original pigments. The sememe hypothesis says that meanings ARE built from a sparse set of semantic "pigments."
Mathematically, the advantage of sparsity comes from the overcomplete dictionary. In PCA, you have exactly d basis vectors for d dimensions — a complete but minimal basis. Every vector must be expressed using ALL basis vectors (dense). In sparse coding, you have m >> d dictionary atoms — far more atoms than dimensions. This redundancy means each word can "choose" a small subset of atoms that best match its meaning, ignoring the rest. Different senses choose different subsets, creating the clean separation we need.
The paper proves that the alternating minimization algorithm converges to a local minimum of the objective under mild conditions on the dictionary. In practice, convergence takes 50-200 iterations and produces dictionaries that are reproducible across random initializations — the solution is robust, not a fragile local minimum. Running the algorithm on standard word2vec embeddings (300 dimensions, 50K vocabulary) takes about 10 minutes on a laptop. No GPU needed.
The dictionary size m (number of atoms) is a hyperparameter. The paper finds that m should be significantly larger than the vocabulary size — typically m ≈ 2000 for a 50K-word vocabulary. This overcomplete dictionary ensures there are enough atoms to capture fine-grained semantic distinctions. If m is too small, different senses are forced to share atoms. If m is too large, atoms become redundant.
On standard word sense benchmarks (SCWS — Stanford Contextual Word Similarity), the recovered sense vectors outperform the original single vectors. For highly polysemous words (3+ senses), the improvement is 8-12% in Spearman correlation with human similarity judgments.
| Method | SCWS ρ | WS-353 ρ | SimLex ρ |
|---|---|---|---|
| word2vec (single) | 0.657 | 0.654 | 0.370 |
| Huang et al. (multi) | 0.628 | 0.631 | 0.355 |
| Arora (sparse recovery) | 0.693 | 0.668 | 0.381 |
The sparse recovery approach outperforms both the original single vectors AND the multi-prototype approach of Huang et al., despite using the same pre-trained embeddings as input. No retraining needed.
The improvement is most dramatic for words with well-separated senses. For "bank" (financial vs. river), the similarity between "bank" and "money" jumps from 0.48 (using the single combined vector) to 0.82 (using the recovered financial sense vector). The similarity between "bank" and "river" simultaneously jumps from 0.33 to 0.79 (using the recovered river sense vector). By decomposing the combined vector, we get much more accurate similarity judgments for each individual sense.
For words with related senses (like "bright" meaning luminous vs. intelligent), the improvement is smaller — about 3-5% — because the senses share semantic features and are harder to separate. This matches the theory: related senses have correlated vectors, making decomposition harder.
A word vector decomposed into sparse atoms. The large coefficients cluster into distinct sense groups. Click "Decompose" to run sparse coding on a random polysemous word.
python import numpy as np from sklearn.decomposition import DictionaryLearning # Learn sparse dictionary from word vectors # word_vectors: (n_words, dim) array of word embeddings dl = DictionaryLearning( n_components=2000, # number of atoms (sememes) alpha=1.0, # sparsity penalty max_iter=100, fit_algorithm='cd', # coordinate descent transform_algorithm='lasso_cd' ) # Fit dictionary and get sparse codes sparse_codes = dl.fit_transform(word_vectors) dictionary = dl.components_ # (2000, dim) atom vectors # For a polysemous word, find its nonzero atoms word_idx = vocab["bank"] coeffs = sparse_codes[word_idx] # (2000,) active = np.where(np.abs(coeffs) > 0.1)[0] print(f"Active atoms: {len(active)}") # typically 3-8 # Sense vectors = cluster atoms by semantic similarity # then sum weighted atoms within each cluster
Let's put it all together. This interactive simulation demonstrates the full pipeline: a polysemous word is embedded as a superposition of sense vectors, and sparse coding recovers the individual senses.
Explore polysemous words in 2D embedding space. Each word has 2-3 senses (colored dots) that combine into a single vector (white). Use sparse recovery to separate them. Toggle atoms on/off. Drag the noise slider to see when recovery fails.
Let's trace the complete sense recovery pipeline step by step for the word "crane" (which can mean a bird or a construction machine):
The decomposition correctly separates the two unrelated senses, and the recovered sense vectors are much more useful for sense-specific tasks.
In the explorer above, each scenario shows:
The paper identifies three failure modes:
The paper reports recovery quality across different categories of polysemous words:
| Word type | Example | Senses | Recovery R2 |
|---|---|---|---|
| Homonyms (unrelated) | bank, bat, crane | 2 | 0.92-0.97 |
| Moderate polysemy | run, set, play | 3-5 | 0.85-0.93 |
| Related senses | bright (light/smart) | 2 | 0.70-0.85 |
| Systematic polysemy | chicken (animal/food) | 2 | 0.80-0.90 |
| Rare secondary sense | mouse (animal vs computer) | 2 (skewed) | 0.60-0.80 |
The best recovery happens for homonyms — words whose senses are completely unrelated. This makes sense: unrelated senses have orthogonal vectors, making them easy to separate. The hardest cases are related senses (where the vectors point in similar directions) and rare senses (where the signal is weak).
python # Evaluate recovery quality for a polysemous word import numpy as np def recovery_quality(v_word, true_senses, recovered_senses): """R² between true and recovered sense vectors.""" # Match recovered to true senses (greedy by cosine sim) matched = [] remaining = list(range(len(true_senses))) for r in recovered_senses: sims = [np.dot(r, true_senses[j]) / ( np.linalg.norm(r) * np.linalg.norm(true_senses[j])) for j in remaining] best = remaining[np.argmax(sims)] matched.append((r, true_senses[best])) remaining.remove(best) # Compute average cosine similarity avg_cos = np.mean([np.dot(r, t) / ( np.linalg.norm(r) * np.linalg.norm(t)) for r, t in matched]) return avg_cos # Higher = better recovery
Word2vec (Mikolov et al., 2013): The embedding algorithm whose vectors this paper analyzes. Word2vec trains by predicting co-occurring words using shallow neural networks (skip-gram or CBOW).
GloVe (Pennington et al., 2014): An alternative embedding algorithm that explicitly factorizes the co-occurrence matrix. The superposition result applies to both word2vec and GloVe, because both algorithms implicitly perform the same matrix factorization — the only difference is in how they approximate the optimization. This universality is a strong point: the superposition property is a consequence of the co-occurrence structure of language, not the details of the training algorithm.
A latent variable model for NLP (Arora et al., 2016): The random walk discourse model from the same group. This earlier paper introduced the generative model that makes the superposition proof possible. The polysemy paper is essentially a corollary of the discourse model. Without the random walk framework, there would be no principled way to connect the generative process of text to the structure of word vectors — and no way to prove that superposition arises naturally from co-occurrence statistics.
The Johnson-Lindenstrauss lemma (1984): The classical result that random projections approximately preserve distances in high dimensions. This is the mathematical foundation for why random vectors are nearly orthogonal — and therefore why the linear approximation in the superposition proof works. The lemma states that n points in high dimensions can be projected to d = O(log n / ε2) dimensions with all pairwise distances preserved up to a (1 ± ε) factor. For word embeddings, this means d = 300 dimensions can preserve distances between up to n = exp(300) ≈ 10130 points — far more than any vocabulary.
Sparse coding (Olshausen & Field, 1996): The computational tool for recovering sense vectors. Originally developed for understanding V1 neurons in visual cortex — neurons respond to oriented edges, which form a sparse, overcomplete representation of visual scenes. The analogy is fitting and deep: both brains and word embeddings seem to use sparse superposition to pack multiple meanings into limited capacity.
Sememes in computational linguistics (Dong & Dong, 2003): The HowNet project in Chinese NLP manually annotated sememes (atomic semantic units) for over 100K Chinese and English words. This paper provides a computational analog — the sparse coding atoms correspond to automatically discovered sememes, bypassing the need for manual annotation.
Superposition hypothesis in neural networks (Elhage et al., 2022): The idea that neural network activations represent many features simultaneously through superposition is a direct descendant of this work. The mechanisms paper at Anthropic explicitly cites the analogy: neurons represent features as sparse linear combinations, just as word vectors represent senses as sparse linear combinations.
Sparse autoencoders for interpretability (Cunningham et al., 2023; Bricken et al., 2023): The use of sparse coding to extract interpretable features from neural network activations is methodologically identical to extracting sense vectors from word embeddings. Same math, different scale.
Contextualized embeddings (ELMo, BERT): These models produce different vectors for each occurrence of a word, effectively solving the polysemy problem by construction. In BERT, the vector for "bank" in "river bank" is completely different from "bank" in "bank account." This is the architectural solution to polysemy — instead of storing a superposition and recovering senses, generate sense-specific vectors on the fly using context.
But understanding WHY static embeddings superpose senses illuminates what contextual models are learning to separate. BERT's self-attention mechanism can be viewed as a dynamic sense disambiguation system: it reads the context, determines which sense is active, and generates a vector that matches that specific sense. The "sense vectors" that Arora et al. recover from word2vec are approximations to what BERT computes dynamically for each occurrence.
Polysemous embeddings (Athiwaratkun & Wilson, 2017): Instead of point vectors, represent each word as a Gaussian mixture in embedding space — one Gaussian per sense. This is an explicit probabilistic version of the linear superposition. The connection is direct: the mixture component means are the sense vectors, and the mixture weights are the sense frequencies. Arora et al.'s result explains WHY this probabilistic model works — because the point-estimate word vector already encodes the mixture structure linearly.
Word sense induction (Reisinger & Mooney, 2010): Early work on automatically discovering word senses from corpus statistics. Unlike Arora et al., this approach requires retraining embeddings with a sense-clustering step integrated into the training loop. The sparse coding approach is post-hoc — it works on any pre-trained vectors.
The superposition result has profound implications beyond word vectors. If meanings superpose linearly in 300-dimensional word2vec, what about the 12,288-dimensional hidden states of GPT-4? The same mathematical structure — high dimensions creating near-orthogonality, enabling many features to coexist in one vector — likely operates at massive scale in modern transformers. The difference is that transformers learn to dynamically compose and decompose these superpositions through attention, while word2vec just stores the static average.
The numbers are striking. In 300 dimensions, you can pack roughly d2/2 ≈ 45,000 nearly-orthogonal directions (this is a consequence of the Johnson-Lindenstrauss lemma). That's more than enough for a vocabulary of 50K words. In 12,288 dimensions, you can fit over 75 million nearly-orthogonal features. This explains why modern transformers can represent such complex, multifaceted concepts — the representational capacity grows quadratically with dimension.
Here's a practical application. Given a polysemous word and a context sentence, you can compute which sense is active by projecting the context vector onto each recovered sense direction:
python def disambiguate(word, context_words, sense_vecs, embeddings): """Which sense of `word` is active given context?""" # Average context word vectors ctx_vec = np.mean([embeddings[w] for w in context_words if w in embeddings], axis=0) # Project onto each sense direction scores = [] for sv in sense_vecs: cos = np.dot(ctx_vec, sv) / ( np.linalg.norm(ctx_vec) * np.linalg.norm(sv)) scores.append(cos) return np.argmax(scores) # index of best-matching sense # Example: "I deposited money at the bank" # context = ["deposited", "money"] # Returns sense 0 (financial) with high confidence
| Year | Work | Contribution |
|---|---|---|
| 2012 | Huang et al. | Multi-prototype embeddings: multiple vectors per word (needs retraining) |
| 2013 | Mikolov et al. (word2vec) | Efficient training of word embeddings; popularized d = 300 |
| 2014 | Pennington et al. (GloVe) | Co-occurrence factorization; same d = 300 convention |
| 2016 | Arora et al. | Random walk discourse model; word vectors as latent variable model |
| 2018 | This paper | Superposition of word senses; sparse coding recovery algorithm |
| 2018 | ELMo (Peters et al.) | Contextualized embeddings; different vector per occurrence |
| 2019 | BERT (Devlin et al.) | Deep bidirectional context; polysemy handled by architecture |
| 2022 | Elhage et al. (Anthropic) | Superposition hypothesis in neural networks; sparse autoencoders |
| 2023 | Bricken et al. (Anthropic) | Monosemantic features via dictionary learning (SAEs) |
The intellectual line from "word senses superpose linearly" to "neural network features superpose linearly" is direct and unbroken. The tools changed — from sparse coding on 300-dimensional word vectors to sparse autoencoders on million-dimensional transformer activations — but the mathematical principle is identical.
The connection to interpretability is not just intellectual. If features in neural networks are stored as superpositions (as the evidence strongly suggests), then understanding the math of superposition is critical for AI safety. Arora et al.'s work provides foundational theory for:
Several limitations of the paper deserve mention:
To fully understand this paper and its context, read in this order:
If Arora et al.'s paper had been published in 2022 instead of 2018, it would have been framed very differently — as a foundational result for the superposition hypothesis in mechanistic interpretability. The mathematical tools (sparse coding, dictionary learning, overcomplete representations) are identical to what the interpretability community independently developed. The delay in making this connection cost the field several years of potential progress.
The lesson: theoretical work in one subfield (word embeddings) can have unexpected applications in another (neural network interpretability). The math of superposition is universal — it doesn't care whether you're decomposing a 300-dimensional word vector or a 12,288-dimensional transformer hidden state. The principles are the same. The tools are the same (sparse coding / sparse autoencoders). Even the failure modes are the same (correlated features, rare features, overcrowded representations).
| Quantity | Typical value | What it means |
|---|---|---|
| R2 of linear fit | 0.82 - 0.97 | Superposition explains >80% of variance |
| Exponent ζ | 0.5 - 0.8 | Sublinear: rare senses get more than proportional weight |
| Active atoms per word | 3 - 8 | Strong sparsity; basis for sense clustering |
| Dictionary size m | ~2000 | Overcomplete: 4-7x the embedding dimension |
| SCWS improvement | +8-12% | Practical gain on word sense benchmarks |
| Computation time | ~15 min (CPU) | No GPU needed for sparse coding on 50K words |