Why 300? Not 50 or 5000? There's a principled bias-variance tradeoff — and an optimal formula for embedding dimension.
You're training word embeddings. The first thing you need to decide is: how many dimensions? Word2vec defaults to 300. GloVe uses 300. BERT uses 768. GPT-3 uses 12,288. Why these numbers?
The standard answer is: "we tried a few values and 300 worked best on our benchmark." This is not satisfying. There should be a principled way to choose the embedding dimension. Too few dimensions and you can't capture the complexity of language — the embeddings are too compressed to distinguish fine-grained semantic differences. Too many dimensions and you overfit to noise in the training data — the embeddings memorize spurious patterns that don't generalize.
Yin and Shen (2018) answer yes. They show that choosing the embedding dimension is fundamentally a bias-variance tradeoff — the same tradeoff that governs model complexity in all of statistics. They introduce a unified metric called PIP loss (Pairwise Inner Product loss) and derive the optimal dimension as a function of corpus statistics.
Drag the dimension slider to see how embedding quality changes. Too few dimensions = high bias (underfitting). Too many = high variance (overfitting). The sweet spot is in between.
The paper's insight transforms embedding dimension selection from a hyperparameter search into a calculation. Instead of training embeddings at d = 50, 100, 200, 300, 500 and comparing downstream performance, you can estimate the optimal d directly from the spectrum of the co-occurrence matrix.
To appreciate why this matters, consider the cost. Training word2vec on a 10-billion-token corpus at d = 300 takes roughly 12 GPU-hours. To sweep over 10 dimension values, that's 120 GPU-hours. For each dimension, you need to evaluate on multiple downstream tasks to find the peak. With Yin and Shen's formula, you compute the singular values of the co-occurrence matrix (a one-time cost) and read off the optimal d directly. The total compute is roughly equivalent to training at a single dimension.
But the contribution goes beyond efficiency. The formula tells you why 300 works well for English Wikipedia. It explains why you need more dimensions for larger corpora. And it predicts that a specialized medical corpus (with a narrow, concentrated vocabulary) needs fewer dimensions than a broad web corpus. These are genuine scientific insights, not just engineering tricks.
As a reference point, here are the embedding dimensions used by major models, and the approximate training corpus size:
| Model | Year | Dimension | Corpus size | Chosen how? |
|---|---|---|---|---|
| Word2vec | 2013 | 300 | 1.6B tokens | Empirical sweep |
| GloVe | 2014 | 300 | 6B tokens | Matched word2vec |
| ELMo | 2018 | 1,024 | 5.5B tokens | Empirical sweep |
| BERT-base | 2018 | 768 | 3.3B tokens | Architecture search |
| GPT-3 | 2020 | 12,288 | 300B tokens | Scaling laws |
Notice the trend: as corpora grow, dimensions grow. But is the relationship principled? That's what this paper answers — at least for the word2vec/GloVe era.
The question also has real engineering consequences. Each dimension costs memory. At d = 300 with a vocabulary of 400K words, the embedding table is 400K × 300 × 4 bytes = 480 MB. At d = 3000, it would be 4.8 GB — too large for many deployment scenarios. Knowing the optimal d prevents wasting memory on dimensions that carry only noise.
And there's a subtler cost: each unnecessary dimension is a degree of freedom that can overfit. If you train a classifier on top of word embeddings, noisy embedding dimensions inject noise into the classifier's input. The classifier then needs more training data to learn to ignore the noise. This cascading effect means over-dimensioned embeddings hurt downstream performance in ways that aren't obvious from the embedding quality alone.
There's also a curse of dimensionality angle. In high-dimensional spaces, distances between points become increasingly uniform — everything is roughly the same distance from everything else. If d is much larger than the true rank r* of the semantic space, the extra noise dimensions inflate all distances equally, washing out the genuine distance structure. K-nearest-neighbor algorithms become unreliable because the nearest neighbor in noisy high-dimensional space might not be the semantically nearest neighbor.
This was actually observed in practice before the paper provided the theoretical explanation. Practitioners found that PCA-reduced 200-dimensional GloVe vectors often outperformed the original 300-dimensional vectors on similarity tasks. The reduction removed noisy dimensions, improving the signal-to-noise ratio. Yin and Shen's framework explains exactly when and why this post-hoc dimensionality reduction helps.
To understand why embedding dimension matters, we need to look at what word embeddings are actually doing under the hood. The key insight: all embedding methods are performing a low-rank matrix factorization.
Start with the raw data: a matrix M of size |V| × |V|, where V is the vocabulary. Entry Mij counts how often word i appears near word j in the training corpus. For GloVe, M is the log co-occurrence matrix. For word2vec, M is related to the PMI (pointwise mutual information) matrix:
Word embeddings of dimension d are equivalent to taking the rank-d truncated SVD of M. The SVD decomposes M into:
where σ1 ≥ σ2 ≥ ... ≥ σ|V| are the singular values in decreasing order. The rank-d approximation keeps only the top d terms:
The word vectors are the rows of UdΣd1/2 — each word gets a d-dimensional vector that captures the d most important "directions" in the co-occurrence structure.
This equivalence between word2vec and SVD was proven by Levy and Goldberg (2014) in a landmark paper. Word2vec's skip-gram with negative sampling (SGNS) is implicitly factorizing the PMI matrix shifted by log(k), where k is the number of negative samples. GloVe is explicitly factorizing the log co-occurrence matrix. Both reduce to the same linear algebra problem: find the best rank-d approximation to a matrix.
This equivalence is powerful because SVD is one of the most thoroughly understood operations in mathematics. Decades of theory about optimal low-rank approximation, perturbation analysis, and noise behavior in truncated SVD become directly applicable to understanding word embeddings. Yin and Shen's paper exploits exactly this connection.
Let's build intuition for the spectrum. The first singular value, σ1, captures the most dominant pattern in the co-occurrence data. This is usually the distinction between content words and function words (the, is, a). The next few singular values capture broad semantic categories: animate vs. inanimate, concrete vs. abstract, positive vs. negative sentiment.
As you go deeper in the spectrum, the distinctions get finer. Singular value #50 might distinguish between types of animals. Singular value #200 might distinguish between breeds of dogs. Singular value #500 might just be capturing noise from the finite corpus — random co-occurrence patterns that don't generalize.
The challenge is: where in this progression does signal end and noise begin? That's exactly what the paper's theory answers.
A concrete way to think about this: singular value #1 separates "the/a/is" from content words — a distinction you need for almost any NLP task. Singular value #300 might distinguish "labrador" from "retriever" — useful for fine-grained tasks but not essential for sentiment analysis. Singular value #5000 might capture the fact that "mitochondria" and "ribosome" co-occur slightly more in a specific textbook chapter — a corpus artifact that won't generalize to new text. The question is where to draw the line.
The spectrum of M is the sequence of singular values σ1, σ2, ..., σ|V|. For real co-occurrence matrices, the spectrum follows a characteristic pattern: a few very large singular values, then a long tail of progressively smaller ones. The question is: where in this tail should we cut?
Empirically, the spectrum follows a power law: σi ≈ C · i−α for some constants C and α. For the Wikipedia co-occurrence matrix, α ≈ 0.8-1.0. This means the 100th singular value is roughly 100−0.9 ≈ 0.013 times the first singular value. By the 1000th, it's 0.001 times the first. The spectrum decays rapidly, concentrating most of the "information" in the top few hundred dimensions.
Why a power law? This is connected to Zipf's law — the observation that word frequencies in natural language follow a power law (the k-th most frequent word has frequency proportional to 1/ks). Since co-occurrence statistics are derived from word frequencies, and products/sums of power-law quantities produce power-law singular values, the power-law spectrum is a natural consequence of the power-law structure of language itself.
The decay rate α varies across corpora and captures the "concentration" of semantic structure. A technical corpus about a narrow domain (e.g., cardiology papers) has large α — most variation is along a few axes (disease types, treatments, patient demographics). A general web corpus has small α — semantic variation is spread across many different topics. The formula d* depends on α, which is why different corpora need different embedding dimensions.
To use the formula, you need to estimate α from your corpus. Here's how:
In practice, the power law holds remarkably well for indices between about 10 and k/2. The very top singular values can deviate (they're dominated by ultra-frequent words like "the" and "is"), and the tail is affected by noise. The middle range gives a clean estimate of α.
For Wikipedia, a typical plot shows log(σ100) − log(σ10) = α × (log(100) − log(10)) = α × log(10) ≈ 2.3α. If the 100th singular value is 1/8 of the 10th, then log(1/8) = −2.08, giving α ≈ 2.08/2.30 ≈ 0.90. This matches the paper's estimates.
Interestingly, α tends to be higher for specialized corpora (medical: α ≈ 1.1, legal: α ≈ 1.05) and lower for diverse corpora (web crawl: α ≈ 0.7, social media: α ≈ 0.65). This reflects the semantic concentration of the domain: specialized texts have most of their variation along a few semantic axes, while general texts spread variation more evenly across many directions.
This has a practical implication: domain-specific embeddings should use lower dimensions than general-purpose ones, even beyond what corpus size alone would suggest. A medical NLP system trained on 100M tokens of clinical notes might do best at d = 100-150, while a general-purpose system trained on the same 100M tokens from Wikipedia might need d = 200-250.
The paper's noise analysis connects to random matrix theory (RMT). The noise matrix E has singular values that follow the Marchenko-Pastur distribution, which predicts a maximum noise singular value of σE · (1 + √(|V|/n)). Any observed singular value above this threshold is likely signal; below it is likely noise. Counting the number of singular values above the Marchenko-Pastur threshold gives an alternative estimate of d* ≈ 250-350 for Wikipedia, consistent with both the formula and empirical observations.
But "most" isn't "all." If the true rank is r* = 500, then singular values #301-#500 carry genuine semantic information. Including them improves the embedding. Excluding them loses signal (bias). Including singular values #501+ adds noise (variance). The optimal d is the one that balances these two errors.
The spectrum of a co-occurrence matrix. Large singular values capture signal; small ones capture noise. Drag the cutoff to see how many you keep (= your embedding dimension).
python import numpy as np from scipy.sparse.linalg import svds # M: (|V|, |V|) PMI matrix (shifted positive) # Compute top-d singular values and vectors d = 300 U, S, Vt = svds(M, k=d) # Sort in decreasing order idx = np.argsort(-S) S = S[idx]; U = U[:, idx]; Vt = Vt[idx, :] # Word embeddings: rows of U * sqrt(S) embeddings = U * np.sqrt(S)[np.newaxis, :] # (|V|, d) # Inspect the spectrum print(f"Top-10 singular values: {S[:10]}") print(f"Singular value at d=300: {S[299]:.4f}") print(f"Fraction of variance captured: {np.sum(S[:d]**2)/np.sum(S**2):.4f}")
Real co-occurrence spectra follow a power law: σi ≈ C · i−α. Adjust α and C to see how different spectra concentrate information. Fast decay (high α) = fewer dimensions needed.
Now we formalize the intuition from Chapter 0. The error in our embedding comes from two sources that pull in opposite directions.
The observed co-occurrence matrix M is not the "true" co-occurrence structure of the language. It's a noisy estimate from a finite corpus. We can decompose it:
where M* is the true signal matrix (what we'd get from an infinite corpus) and E is the noise matrix (sampling error from finite data). The true signal matrix M* has some rank r* — the number of genuine "linguistic components" or latent semantic dimensions in the language.
Think of it this way. If you flip a coin 10 times, you might get 7 heads and 3 tails. The "true" probability is 0.5, but your estimate is 0.7. The error (0.2) is noise from insufficient data. The same thing happens with co-occurrence counts. If "doctor" and "nurse" co-occur 47 times in your corpus, the "true" co-occurrence rate might be 50 per million words, and the difference is sampling noise. Every entry in M has this kind of noise.
The noise matrix E has an important property: its singular values are roughly equal. This is because the noise in each entry is independent — it's like adding white noise to an image, which distributes energy equally across all frequencies. If M* has rank r* = 200 and your vocabulary is 50,000, then M has 50,000 singular values, but only the first 200 come primarily from signal. The remaining 49,800 come from noise.
The challenge: somewhere around singular value #200, signal and noise are mixed. The 180th singular value might be 80% signal and 20% noise. The 220th might be 40% signal and 60% noise. You can't cleanly separate them — you can only find the optimal cutoff where keeping one more noise-contaminated dimension hurts more than it helps.
This is precisely analogous to a classic problem in statistics and signal processing. In astronomy, it's deciding how many principal components to keep when analyzing galaxy spectra. In genomics, it's choosing how many gene expression factors are real vs. noise. In audio processing, it's determining how many Fourier coefficients to keep when compressing music. The math is identical in all these cases — only the domain differs.
The classical solution in statistics is the scree plot — plot the singular values in decreasing order and look for an "elbow" where the curve flattens. Below the elbow, you're in the noise floor. Above it, you're in the signal regime. But the scree plot is subjective — different people see the elbow at different places. Yin and Shen's formula replaces this subjective judgment with a principled calculation.
When we keep only d singular values, we throw away the remaining |V| − d. If d < r* (our embedding dimension is smaller than the true rank), we lose signal. The bias is the information about genuine linguistic structure that we discard:
This is the sum of squared true singular values that we threw away. Bias decreases as d increases — more dimensions means less lost signal.
When d > r*, we're including singular values that come entirely from sampling noise. Even when d ≤ r*, the estimated singular values are contaminated by noise. The variance measures how much our embedding fluctuates due to random sampling:
where n is the corpus size and σnoise is the noise level. Variance increases linearly with d — each additional dimension adds another degree of freedom that can overfit to noise.
If you've studied machine learning, this should feel familiar. A model with too few parameters underfits (high bias). A model with too many parameters overfits (high variance). The optimal complexity is in between. For word embeddings, the "model complexity" is the dimension d, and the "parameters" are the d × |V| entries of the embedding matrix.
With d = 50 and |V| = 50,000, the embedding has 2.5 million parameters — not enough to capture all the semantic structure in a billion-word corpus. With d = 5,000 and |V| = 50,000, the embedding has 250 million parameters — more than enough to memorize noise in the co-occurrence counts. The sweet spot (d ≈ 300, giving 15 million parameters) is where the model has enough capacity to capture signal but not enough to memorize noise.
Let's push the telescope analogy further, because it's load-bearing. You're photographing a distant nebula:
The optimal pixel size depends on conditions. On a clear night with a good telescope (low noise, lots of data), you can use more pixels. On a foggy night (high noise), you need fewer, larger pixels. Yin and Shen's formula tells you exactly how many pixels to use, given the conditions.
For readers with a statistics background, the bias-variance tradeoff for embedding dimension is a special case of the general model selection problem. In regression, you choose between a simple model (linear, 2 parameters) and a complex model (polynomial degree 10, 11 parameters). The simple model has high bias but low variance; the complex model has low bias but high variance. The optimal complexity minimizes the expected prediction error.
For embeddings, "model complexity" = dimension d, "parameters" = the d × |V| entries of the embedding matrix, "data" = the co-occurrence counts, and "prediction error" = PIP loss. The formula d* = (C2n/σE2)1/(2α) is the embedding analog of the AIC/BIC model selection criteria in statistics. It balances goodness of fit (bias) against parsimony (variance) to find the model (dimension) that will generalize best to new text.
This perspective also connects to degrees of freedom in statistics. Each embedding dimension uses |V| degrees of freedom (one per word). The total degrees of freedom in a d-dimensional embedding is d × |V|. When this exceeds the effective number of independent observations in the corpus, the model is overfit. The formula implicitly encodes this constraint.
Consider a concrete example: d = 300, |V| = 50,000. Total parameters: 15 million. For a 1-billion-token corpus with a window size of 5, the effective number of co-occurrence observations is roughly 50,000 × 50,000 = 2.5 billion, but most are zero (sparse matrix), giving maybe 500 million nonzero entries. With 15 million parameters and 500 million observations, we have a healthy ratio of 33 observations per parameter. At d = 3000, we'd have 150 million parameters — only 3.3 observations per parameter, which is dangerously close to overfitting. The formula captures exactly when this ratio becomes problematic.
There's also an information-theoretic way to think about the optimal dimension. Each co-occurrence count in the matrix carries some information about the language. The total information in the corpus is finite — roughly proportional to n log(|V|) bits. The embedding stores d × |V| × 32 bits of information (assuming 32-bit floats). When the embedding's information capacity exceeds the information in the corpus, the extra capacity must be filled with noise. The optimal d is where the embedding's capacity matches the information in the data.
This is the minimum description length (MDL) principle applied to embeddings: the best model is the one that most efficiently encodes the data. Too few dimensions = the encoding is lossy (can't represent the data). Too many = the encoding wastes bits on noise. The sweet spot is the most compact encoding that captures all the genuine signal.
Interestingly, the U-shaped performance curve that Yin and Shen observe is closely related to the double descent phenomenon discovered later (Belkin et al., 2019). In classical statistics, increasing model complexity always eventually hurts (the right side of the U-curve). But in over-parameterized deep learning, performance can improve again after the interpolation threshold — creating a "double descent" curve with a second improvement phase at very high dimensions.
For word embeddings, Yin and Shen only observe the classical U-shape — no second descent. This makes sense: word embeddings are a simple linear model, and double descent requires the highly over-parameterized regime where the model can memorize the training data perfectly. At d > |V| (more dimensions than vocabulary), word embeddings are over-parameterized, but the SVD automatically regularizes by sorting singular values. Understanding when the bias-variance U-curve gives way to double descent is an active area of research.
To fully understand this paper and its foundations, read in this order:
The total error (in terms of PIP loss, which we'll define properly in the next chapter) is:
The optimal d minimizes this sum. As d increases from 0, bias drops rapidly (we're capturing important signal) while variance grows slowly. At some point, the variance increase from adding one more dimension exceeds the bias decrease — that's the optimal dimension.
Suppose the true singular values are σi* = 10/i (a power law with α = 1, C = 10), the noise per dimension is σnoise = 0.5, and the corpus gives us n = 10,000 effective observations. At d = 20:
Bias dominates — we should use more dimensions. At d = 200:
Still bias-dominated, but the gap is closing. At d = 1000:
Now variance is approaching bias. The minimum of bias + variance occurs around d* ≈ 400 for these parameters. This matches the formula: d* = (C2n/σE2)1/(2α) = (100 · 10000/0.25)0.5 = (4,000,000)0.5 ≈ 2000. (In practice the effective n is much smaller than the raw corpus size due to correlations, which brings d* down to the hundreds.)
Drag the dimension slider to see the bias, variance, and total error at that specific d. Watch how the dominant error source shifts from bias (at low d) to variance (at high d).
Watch how bias (decreasing) and variance (increasing) change with embedding dimension. The total error (white) has a minimum at the optimal d. Drag the noise slider to see how noise affects the optimum.
Before Yin and Shen's paper, there was no good way to measure embedding quality that was both theoretically grounded and applicable to any downstream task. Different evaluation metrics (word similarity, analogy, NER accuracy) often disagreed about which embedding was best. The paper introduces PIP loss as a unified solution.
PIP stands for Pairwise Inner Product. The idea is simple but powerful: what matters about word embeddings is the geometry — specifically, the inner products (and therefore cosine similarities) between word vectors. All downstream tasks ultimately depend on how words relate to each other, and inner products capture that.
Consider what you actually DO with word vectors. You compute similarity (cosine, which is a normalized inner product). You solve analogies (vector arithmetic, which is addition and subtraction of vectors — both preservable by inner products). You cluster them (k-means, which depends on distances, which are inner products). You feed them to classifiers (linear classifiers are literally inner products with learned weights). Every use of word vectors goes through inner products.
This observation leads to a key insight: two sets of embeddings that produce the same inner products are functionally identical, even if the individual vectors look different. The inner product structure is all that matters.
Here's a concrete example. Train word2vec twice with different random seeds. The resulting vectors will be completely different — "king" might point northeast in one run and southwest in another. But the inner products are nearly identical: cos(king, queen) ≈ 0.65 in both runs. The absolute positions are arbitrary; the relative geometry is what carries semantic information. PIP loss captures exactly this relative geometry.
where W is the |V| × d embedding matrix (rows are word vectors). The PIP matrix is |V| × |V|, with entry (i,j) being the inner product of the embeddings for words i and j. It captures the complete relational geometry of the embeddings.
The PIP loss between two sets of embeddings W and W' is:
This measures how much the pairwise inner product structure differs between two embedding spaces. It's invariant to rotations (applying an orthogonal transformation to W doesn't change WWT), which is important because the absolute orientation of the embedding space is arbitrary — only relative positions matter.
For the "true" embeddings (from infinite data), the PIP matrix equals the true signal matrix M*. So we can write the PIP loss between our estimated embeddings and the truth as:
This is exactly the Frobenius norm error of the rank-d approximation to M*, which decomposes cleanly into bias and variance as we showed in Chapter 2.
A natural question: why should PIP loss — a purely geometric metric — predict performance on tasks like NER or sentiment analysis? The paper argues this theoretically: any linear function of word vectors (which includes most classic NLP features) depends only on the Gram matrix WWT. Two embedding matrices with the same PIP matrix produce identical linear classifiers, identical cosine similarities, and identical analogy solutions. PIP loss captures exactly the task-relevant information.
For nonlinear models (like neural networks), PIP loss is an approximation — but in practice, the authors show it correlates strongly with nonlinear downstream performance too. The intuition is that the geometric structure captured by inner products is the primary information that all models, linear or not, rely on.
python import numpy as np def pip_loss(W1, W2): """Compute PIP loss between two embedding matrices.""" # W1, W2: (|V|, d) embedding matrices G1 = W1 @ W1.T # (|V|, |V|) PIP matrix G2 = W2 @ W2.T n = W1.shape[0] return np.sum((G1 - G2)**2) / (n * n) # Verify rotation invariance d = 50 W = np.random.randn(100, d) * 0.1 Q, _ = np.linalg.qr(np.random.randn(d, d)) # random rotation W_rotated = W @ Q print(pip_loss(W, W_rotated)) # ≈ 0.0 (machine precision) # Compare with perturbed embeddings W_noisy = W + np.random.randn(*W.shape) * 0.01 print(pip_loss(W, W_noisy)) # > 0 (perturbation changes geometry)
Two sets of word embeddings produce PIP matrices. The difference between them (PIP loss) captures how much the relational structure disagrees. Drag to add a rotation — PIP loss should stay zero.
The paper proves that the expected PIP loss can be exactly decomposed:
where Lbias decreases with d (capturing more signal) and Lvar increases with d (fitting more noise). This is the formal justification for the bias-variance tradeoff we discussed.
Now we derive the main result: the formula for the optimal embedding dimension d*.
The paper models the singular values of the true signal matrix as following a power law (which is empirically well-supported):
where α > 0 is the spectral decay rate and C is a constant. Typical values for real co-occurrence matrices are α ∈ [0.5, 1.5]. A large α means the spectrum decays rapidly — most information is concentrated in the top few singular values. A small α means a slow decay — information is spread across many dimensions.
The bias from truncating at dimension d is the sum of squared discarded singular values:
For large d, this sum is well-approximated by an integral (a standard technique for summing monotone series):
When α > 1/2, we have 1 − 2α < 0, so the term at infinity vanishes:
This is valid when α > 1/2, which is the typical regime for real co-occurrence matrices. Notice that bias is a decreasing function of d (since the exponent 1 − 2α is negative). Each additional dimension reduces bias, but with diminishing returns — the (d+1)-th singular value is smaller than the d-th.
The variance from including d dimensions, each contaminated by noise level σE, is:
where n is effectively the number of independent observations (related to corpus size). This is an increasing linear function of d. Why linear? Because each dimension contributes independently to the variance — the noise in dimension 1 is independent of the noise in dimension 2 (by the orthogonality of singular vectors). So total noise = sum of per-dimension noise = d times the noise per dimension.
The noise per dimension, σE2/n, decreases with corpus size. This is the law of large numbers: more observations → better estimates → less noise. A corpus with 1 billion tokens has 10x less noise per co-occurrence entry than a corpus with 100 million tokens.
Putting the two together, we can visualize the tradeoff as a tug-of-war. Increasing d by 1 gives us:
The optimal d is where these two are equal: the marginal benefit of the next dimension equals the marginal cost. Beyond this point, every additional dimension hurts more than it helps.
The total error is:
To find the minimum, take the derivative with respect to d and set it to zero:
Simplifying (note that (1−2α)/(2α−1) = −1):
Solving for d:
The formula has a beautiful interpretation. It says: keep adding dimensions until the marginal signal from the next singular value drops below the noise floor. The "noise floor" is set by σE2/n — less data means a higher noise floor, which means you should use fewer dimensions.
For a typical English corpus with 1 billion tokens and a vocabulary of 100K words, the formula gives d* in the range 200-400 — remarkably close to the empirically popular choice of 300!
The formula reveals four independent "knobs" that control the optimal dimension:
python import numpy as np def optimal_dimension(singular_values, corpus_tokens, vocab_size): """Estimate d* from the singular value spectrum.""" # Estimate alpha by fitting power law to spectrum log_idx = np.log(np.arange(1, len(singular_values)+1)) log_sv = np.log(singular_values + 1e-10) # Linear regression: log(sigma_i) = log(C) - alpha * log(i) alpha, log_C = -np.polyfit(log_idx, log_sv, 1) C = np.exp(log_C) # Estimate noise from tail of spectrum tail = singular_values[int(len(singular_values)*0.8):] sigma_E = np.std(tail) # Effective n (corpus tokens / vocab gives avg freq) n_eff = corpus_tokens / vocab_size # Formula: d* = (C^2 * n / sigma_E^2) ^ (1 / 2*alpha) d_star = (C**2 * n_eff / sigma_E**2) ** (1 / (2 * alpha)) return int(np.round(d_star)), alpha, C, sigma_E # Example: Wikipedia corpus # d_star ≈ 275, alpha ≈ 0.9, C ≈ 12.3, sigma_E ≈ 0.4
Adjust the spectral decay, noise level, and corpus size to see how the optimal dimension changes. The formula predicts where bias and variance cross.
The theory is elegant, but does it match reality? The paper validates the optimal dimension formula on real embedding benchmarks.
So far, we've built a complete theory: co-occurrence matrices have a spectral structure (Chapter 1), truncating them creates a bias-variance tradeoff (Chapter 2), PIP loss quantifies this tradeoff (Chapter 3), and the optimal cutoff has a closed-form formula (Chapter 4). Now we need to check: does reality agree with the theory?
The authors train GloVe and word2vec embeddings on the Wikipedia corpus at dimensions d = 25, 50, 100, 200, 300, 400, 500, 800, 1000. They evaluate on:
On every benchmark, performance first improves as d increases, reaches a peak, then degrades. This is exactly what the bias-variance theory predicts. The peak occurs at different d values for different tasks, but it's always in the range 200-400 for the Wikipedia corpus.
This was not previously understood. Many NLP practitioners assumed "more dimensions = better" and used the largest d their memory could support. The paper shows this is wrong — there is a real, measurable performance decrease at high dimensions. For the word similarity task, d = 1000 performs 15% worse than d = 300. That's not noise — it's systematic overfitting to co-occurrence noise.
The U-shape is more pronounced for smaller corpora. On the 10M-token subset, the peak is sharper and occurs at a lower d (around 100). On the 1B-token corpus, the curve is flatter near the peak — d = 200 and d = 400 perform similarly, both near the optimum. This makes sense: with more data, the noise floor is lower, so more dimensions carry genuine signal, and the transition from "useful dimension" to "noise dimension" is more gradual.
The PIP loss between the learned embeddings and the "ideal" embeddings (estimated using a held-out portion of the corpus) tracks task performance remarkably well. The d that minimizes PIP loss is within 10-20% of the d that maximizes downstream accuracy, even though PIP loss knows nothing about the downstream task.
This is a remarkable result. PIP loss is computed purely from the embedding geometry — no labels, no task-specific evaluation. Yet it predicts which dimension will perform best on word similarity, analogy, NER, and chunking. The correlation between PIP loss rank and downstream performance rank is above 0.95 across all tasks tested.
The authors estimate α from the corpus's singular value spectrum by fitting a line in log-log space. They estimate σE from the tail of the spectrum (the last 10% of singular values, assumed to be pure noise). With these estimates, the formula d* = (C2n/σE2)1/(2α) gives d* ≈ 275 for the Wikipedia corpus, very close to the empirically optimal d = 300.
Across three different corpora (Wikipedia 1B tokens, Google News 100B tokens, and a small 10M token subset), the formula's predictions are consistently within 10-15% of the empirically measured optima. This accuracy is especially impressive because the empirical optima are themselves uncertain — the performance curves are quite flat near the peak, making the exact optimum hard to pinpoint within ±50 dimensions.
Performance across dimensions for different benchmarks on the Wikipedia corpus. Toggle between metrics to see the U-shaped curve.
| Dimension d | SimLex ρ | Analogy % | NER F1 |
|---|---|---|---|
| 50 | 0.31 | 52.1 | 87.2 |
| 100 | 0.36 | 63.8 | 89.1 |
| 200 | 0.39 | 72.4 | 90.3 |
| 300 | 0.41 | 75.1 | 90.8 |
| 500 | 0.39 | 74.3 | 90.2 |
| 1000 | 0.35 | 70.8 | 89.4 |
The authors also train on corpora of different sizes (10M, 100M, 1B tokens) and domains. The optimal d shifts as the theory predicts:
This explains a persistent puzzle in NLP: why do some papers report that d = 100 works best while others find d = 300 optimal? The answer is corpus size. Papers using small corpora (academic datasets, domain-specific text) genuinely do better with lower dimensions. Papers using large web crawls or Wikipedia need higher dimensions. There was never a contradiction — just different points on the same bias-variance curve.
The authors propose using PIP loss as a model selection criterion. Instead of evaluating on downstream tasks (which requires labeled data and task-specific pipelines), compute PIP loss between the embeddings trained at dimension d and a reference set trained at very high dimension on a held-out corpus. The d that minimizes PIP loss is the optimal choice. This is faster and requires no labeled data.
The degradation at high dimensions is subtle and worth understanding. When d = 1000 on a Wikipedia corpus, the embeddings are fitting patterns like "in the training corpus, the word 'photosynthesis' appeared 3 times near 'basketball' by random chance." These spurious co-occurrences become features in the embedding. When you evaluate on a different dataset (the test set), these noise features don't generalize — they actively hurt, because the noise in the test set is different from the noise in the training set.
You can detect this overfitting empirically. Train two sets of embeddings on non-overlapping halves of the same corpus at the same dimension d. Compute the PIP loss between them. At low d, the PIP loss is small — both halves agree on the broad semantic structure. At high d, the PIP loss increases — each half is fitting its own noise, and the noise in the two halves is different. The d at which PIP loss starts rising sharply is approximately d*. This is a clean, practical diagnostic that requires no downstream evaluation.
This is classical overfitting, but it looks different from what you might expect. The model isn't memorizing training examples — it's memorizing statistical noise in the co-occurrence matrix. The embeddings look reasonable (words still cluster by topic), but their fine-grained structure includes random artifacts that degrade downstream performance by 2-5%.
python # Demonstrate the overfitting effect import numpy as np def simulate_dimension_sweep(true_rank=200, vocab=5000, corpus_tokens=100_000_000, noise_scale=0.5): """Simulate how performance varies with embedding dimension.""" # True signal: rank-r matrix with power-law spectrum sigmas = np.array([10 * i**(-0.8) for i in range(1, true_rank+1)]) results = [] for d in [25, 50, 100, 200, 300, 500, 800, 1000]: # Bias: sum of squared sigmas we discard bias = sum(s**2 for s in sigmas[d:]) if d < true_rank else 0 # Variance: proportional to d * noise^2 / n variance = d * noise_scale**2 * vocab / corpus_tokens total = bias + variance results.append((d, bias, variance, total)) return results # [(d, bias, var, total), ...]
Let's put the full theory into practice. This interactive simulation lets you generate a synthetic co-occurrence matrix with a known true rank, add noise, and watch how different embedding dimensions perform. You can see the bias-variance tradeoff in action and verify that the optimal dimension formula works.
Generate a synthetic language with a known true dimensionality. Add noise to simulate finite sampling. Then sweep through dimensions and watch bias, variance, and total error evolve. The predicted optimal d (formula) is shown as a vertical line. Click "Generate" to create a new scenario.
Play with the four sliders above and watch how the spectrum (top) and bias-variance curves (bottom) respond. Key observations:
| Corpus size | Recommended d | Rationale |
|---|---|---|
| 1M tokens | 50-100 | Small corpus, high noise — can't trust many dimensions |
| 100M tokens | 150-250 | Moderate data — signal emerges in hundreds of dimensions |
| 1B tokens | 250-400 | Large corpus — noise floor is low, 300+ dimensions carry signal |
| 10B+ tokens | 400-1000 | Massive data — can support higher dimensionality |
Here is a complete pipeline to estimate the optimal embedding dimension from a co-occurrence matrix, without training at multiple dimensions:
python import numpy as np from scipy.sparse.linalg import svds from scipy.optimize import minimize_scalar def find_optimal_d(cooc_matrix, corpus_size, max_d=1000): """Estimate optimal embedding dim from co-occurrence matrix.""" # Step 1: Compute top singular values k = min(max_d, cooc_matrix.shape[0] - 1) _, S, _ = svds(cooc_matrix, k=k) S = np.sort(S)[::-1] # descending order # Step 2: Fit power law: log(sigma_i) = log(C) - alpha*log(i) log_i = np.log(np.arange(1, len(S)+1)) log_s = np.log(S + 1e-10) # Use middle range (avoid edge effects) mid = slice(10, len(S)//2) alpha, log_C = -np.polyfit(log_i[mid], log_s[mid], 1) C = np.exp(log_C) # Step 3: Estimate noise from spectrum tail tail = S[int(len(S)*0.9):] sigma_E = np.mean(tail) # Step 4: Apply formula n_eff = corpus_size / cooc_matrix.shape[0] d_star = int((C**2 * n_eff / sigma_E**2) ** (1/(2*alpha))) print(f"Fitted: alpha={alpha:.2f}, C={C:.2f}, sigma_E={sigma_E:.4f}") print(f"Optimal dimension: d* = {d_star}") return d_star
Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014): The embedding algorithms whose dimension this paper analyzes. Both implicitly perform low-rank matrix factorization, which is why the spectral perspective applies.
Levy & Goldberg (2014): Showed that word2vec's skip-gram with negative sampling is implicitly factorizing the shifted PMI matrix: Mij = PMI(wi, wj) − log(k), where k is the number of negative samples. This connection is the essential bridge that lets Yin & Shen apply matrix perturbation theory to word embeddings. Without it, there would be no matrix to analyze and no spectrum to decompose.
Matrix perturbation theory (Weyl, Davis-Kahan, etc.): The mathematical tools for analyzing how eigenvalues and eigenvectors change when a matrix is perturbed by noise. The bias-variance decomposition follows directly from these classical results. Weyl's theorem bounds how much singular values can change under perturbation. The Davis-Kahan theorem bounds how much singular vectors rotate. Together, they give the variance term in the PIP loss decomposition.
Arora et al. (2016): The random walk discourse model that provides a generative story for word embeddings. This model predicts that the co-occurrence matrix should have a power-law spectrum (which it does), validating one of Yin & Shen's key assumptions.
Efficient embedding training: Instead of training at multiple dimensions and comparing, practitioners can estimate d* from the corpus spectrum. This saves significant compute for large-scale embedding projects.
Adaptive dimensionality: The formula shows that different corpora need different dimensions. A medical corpus (specialized vocabulary, fast spectral decay) needs fewer dimensions than a general web corpus. This insight has been widely adopted in domain-specific NLP, where practitioners now routinely use lower-dimensional embeddings for specialized domains rather than blindly using d = 300.
Understanding scaling laws: The power-law spectrum and optimal-d formula are early examples of scaling analysis in NLP. The insight that more data supports more parameters (dimensions) foreshadows the neural scaling laws of Kaplan et al. (2020).
Embedding compression: The theory suggests that many deployed embeddings are over-dimensioned for their corpus size. For mobile and edge applications, knowing that d = 100 is near-optimal for a small domain corpus (instead of the default d = 300) can reduce memory by 3x with minimal quality loss.
Cross-lingual insight: Different languages have different semantic structures (and therefore different spectral properties). Yin and Shen's framework predicts that morphologically rich languages (like Turkish or Finnish, where a single word can carry many meanings through inflection) might need different dimensions than analytic languages (like Chinese). This prediction has been partially confirmed by subsequent work on multilingual embeddings.
Post-training compression: The theory provides a principled basis for reducing embedding dimensions after training. Instead of retraining at lower d, you can simply truncate the SVD of the learned embeddings. The paper shows that PCA reduction to the theoretical d* preserves most task performance while reducing memory by 50-70%. This is especially valuable for deploying embeddings on resource-constrained devices like phones, where every megabyte of model size matters.
Domain adaptation: When fine-tuning pre-trained embeddings on a small domain corpus, the formula suggests reducing dimensions. The small corpus has higher noise and lower effective n, pushing d* downward. Practitioners who fine-tune 300-dimensional embeddings on a 1M-token domain corpus are likely overfitting to noise in dimensions 100+. The formula recommends training at d = 50-80 for such scenarios.
Despite these limitations, the paper established a foundational principle: embedding dimension is a bias-variance tradeoff with a mathematically optimal solution. Even in the era of large transformers, this insight guides architecture design — the hidden dimension of BERT/GPT is chosen to balance representational capacity against the risk of overfitting, exactly as Yin and Shen described for word embeddings.
An exciting open question is whether a similar formula exists for transformers. The challenge is that transformers don't just factorize a co-occurrence matrix — they learn hierarchical, nonlinear representations across many layers. Each layer has its own "effective dimensionality," and the optimal dimension might vary by layer (indeed, recent work on layer-wise pruning confirms this).
However, the core insight — that there's a finite amount of "semantic signal" in a corpus, and each additional dimension needs data to justify — almost certainly generalizes. The neural scaling laws (loss ∝ N−α for model size N) are the deep learning analog of Yin and Shen's bias-variance curve. Both describe the same phenomenon: diminishing returns from adding capacity beyond what data can support.
Modern transformers use much larger embedding dimensions: BERT uses 768, GPT-3 uses 12,288. Are these optimal? The formula suggests they might be — these models train on far more data (hundreds of billions of tokens), which supports much higher d*. But the formula assumes linear embeddings, while transformers use highly nonlinear representations. Extending the bias-variance framework to deep models remains an open problem.
However, the intuition transfers perfectly. The neural scaling laws of Kaplan et al. (2020) show that transformer performance follows power laws in model size, data size, and compute — precisely the kind of relationship Yin and Shen found for embedding dimensions. The optimal model size increases sublinearly with data size, just as d* increases sublinearly with corpus size. The bias-variance logic is the same; only the parameterization differs.
| Model | Dimension | Training tokens | d/log(n) ratio |
|---|---|---|---|
| Word2vec | 300 | 109 | 33 |
| BERT-base | 768 | 3 × 109 | 81 |
| GPT-2 | 1,024 | 8 × 109 | 103 |
| GPT-3 | 12,288 | 3 × 1011 | 1,069 |
The d/log(n) ratio grows as models scale, consistent with the theory that more data supports higher dimensionality — though modern models use far more parameters than the formula predicts, suggesting that nonlinear representations can extract more signal per dimension than linear embeddings.
An interesting observation: GPT-3's dimension (12,288) is roughly (300) × (300B/1B)1/2 ≈ 300 × 17 ≈ 5,200. The actual dimension is about 2.4x larger, suggesting either that nonlinear models benefit from extra capacity or that the scaling exponent is different for transformer architectures. Answering this precisely would require extending Yin and Shen's framework to deep models — an important open problem.
| Quantity | Value | Implication |
|---|---|---|
| Optimal d for Wikipedia (1B tokens) | ~275 | Close to the empirical standard d = 300 |
| Spectral decay α | 0.7 - 1.2 | Determines how fast information concentrates in top dims |
| Formula | d* = (C2n/σE2)1/2α | Derivable from corpus spectrum alone |
| PIP loss vs performance correlation | ρ > 0.95 | PIP loss predicts downstream quality without labels |
| Degradation at d = 1000 vs optimal | -10 to -15% | Over-dimensioning is not harmless |
| Formula accuracy | ±10-15% | Close enough for practical dimension selection |