The best of both worlds — combine the global statistical information of count-based methods like LSA with the local context-window power of Word2Vec, by training on the co-occurrence matrix with a clever weighted least-squares objective.
By 2014, there were two fundamentally different approaches to learning word vectors, and they seemed unrelated.
Latent Semantic Analysis (LSA) and related methods build a large word-context co-occurrence matrix, then reduce its dimensionality using SVD (Singular Value Decomposition). They capture global statistics — how often every word appears with every other word across the entire corpus.
Word2Vec trains a neural network to predict context words from a center word (or vice versa) using a sliding window. It captures local patterns — which words appear near each other in a window of 5-10 words.
The Stanford NLP group asked: why do these approaches give different results? Can we get the best of both? The answer was GloVe (Global Vectors) — a model that trains on the global co-occurrence matrix but uses a log-bilinear objective that produces the same kind of linear structure as Word2Vec.
Count-based methods look at the whole matrix at once; predictive methods slide a window. GloVe combines both. Toggle to compare approaches.
GloVe starts by building a word-word co-occurrence matrix X from the entire corpus. Entry Xij counts how many times word j appears in the context of word i.
Slide a symmetric window of size c around each word. For each (center, context) pair, increment Xcenter, context. But with a twist: GloVe uses harmonic weighting — context words that are farther from the center contribute less. A word at distance d contributes 1/d to the count.
For the sentence "the cat sat on the mat" with c = 2:
After processing the entire corpus, X is a V × V matrix. It is sparse — most word pairs never co-occur — and symmetric (Xij = Xji for symmetric windows).
The total co-occurrence count for word i (sum of row i).
The probability that word j appears in the context of word i.
For a real corpus:
| Corpus | Vocab V | Matrix size V² | Nonzero entries | Density |
|---|---|---|---|---|
| Wikipedia 6B | 400K | 160 billion | ~1 billion | 0.6% |
| Common Crawl 42B | 1.9M | 3.6 trillion | ~10 billion | 0.0003% |
The matrix is extremely sparse. GloVe only processes nonzero entries, which is why it scales to enormous vocabularies. The co-occurrence matrix is stored as a sparse data structure (list of (i, j, X_ij) triples), not as a dense V × V array.
For the sentence "the cat sat on the mat" with c = 2 and harmonic weighting:
After scanning the entire sentence, each cell accumulates contributions from all co-occurrence events. The harmonic weighting 1/d naturally down-weights distant context words.
Watch the matrix get built as we slide a window across a sentence. Brighter cells = higher co-occurrence counts. Drag the window size to see how it affects the matrix.
python import numpy as np from collections import defaultdict def build_cooccurrence(corpus, vocab, window=10): """ corpus: list of list of int (tokenized, vocab-indexed sentences) vocab: dict mapping word -> index window: context window size Returns: sparse co-occurrence matrix (dict of dicts) """ V = len(vocab) cooccur = defaultdict(lambda: defaultdict(float)) for sentence in corpus: for i, center in enumerate(sentence): for d in range(1, window + 1): weight = 1.0 / d # harmonic weighting for offset in [-d, d]: j = i + offset if 0 <= j < len(sentence): context = sentence[j] cooccur[center][context] += weight return cooccur # cooccur[i][j] = X_ij # For a corpus of 6B tokens with V=400K words: # - X has ~400K x 400K = 160B entries # - But only ~1B are nonzero (0.6% density) # - Stored as sparse matrix: ~10 GB
Both methods use a sliding context window, but handle the counts differently:
| Property | Word2Vec (Skip-gram) | GloVe |
|---|---|---|
| Counting | Implicit (each pair is a training example) | Explicit (build X matrix first) |
| Distance weighting | None (all positions in window treated equally) | 1/d harmonic weighting |
| Frequency weighting | Subsampling frequent words + noise dist f^0.75 | f(X_ij) = min((X/x_max)^0.75, 1) |
| Memory | O(V · d) for embeddings only | O(nonzero entries) for X + O(V · d) for vectors |
| Passes over data | Stream corpus once (or few times) | One pass to build X, then iterate on X |
This chapter contains GloVe's most important contribution: the insight that word meaning is encoded not in raw co-occurrence probabilities, but in their ratios.
Consider two target words: "ice" and "steam." We want to learn vectors that capture their relationship. Let's look at their co-occurrence probabilities with various probe words k:
| Probe word k | P(k | ice) | P(k | steam) | P(k | ice) / P(k | steam) |
|---|---|---|---|
| solid | 1.9 × 10−4 | 2.2 × 10−5 | 8.9 (large — "solid" is much more ice-like) |
| gas | 6.6 × 10−5 | 7.8 × 10−4 | 0.085 (small — "gas" is much more steam-like) |
| water | 3.0 × 10−3 | 2.2 × 10−3 | 1.36 (near 1 — "water" is related to both) |
| fashion | 1.7 × 10−5 | 1.8 × 10−5 | 0.96 (near 1 — "fashion" is unrelated to both) |
The raw probabilities P(k | ice) and P(k | steam) are hard to interpret in isolation — they're tiny numbers that depend on the overall frequency of each word. But the ratio tells a clear story:
The ratio insight generalizes beyond ice/steam. Consider "cat" vs. "dog":
| Probe word k | P(k | cat) | P(k | dog) | Ratio | Interpretation |
|---|---|---|---|---|
| purr | 5.2 × 10−5 | 1.3 × 10−6 | 40 | Strongly cat-like |
| bark | 2.1 × 10−6 | 8.7 × 10−5 | 0.024 | Strongly dog-like |
| pet | 4.1 × 10−4 | 3.8 × 10−4 | 1.08 | Equally related to both |
| algorithm | 1.0 × 10−6 | 0.9 × 10−6 | 1.11 | Unrelated to both |
The ratio cleanly separates four cases: (1) cat-specific words (ratio ≫ 1), (2) dog-specific words (ratio ≪ 1), (3) shared pet words (ratio ≈ 1, large probabilities), and (4) irrelevant words (ratio ≈ 1, tiny probabilities). No other statistic captures this distinction so cleanly.
If we tried to use raw P(k | ice) instead of ratios, we'd face a problem: the probabilities depend on the overall frequency of the target word. A very common word like "the" has large P(k | "the") for almost every k — not because "the" is semantically related to everything, but because it appears in so many contexts. The ratio P(k | ice) / P(k | steam) cancels out this frequency effect, isolating the semantic signal.
GloVe's design requirement: the word vectors should encode the co-occurrence probability ratios. Specifically, we want a function F such that:
where wi, wj are target word vectors and w̃k is a context word vector. The function F takes three vectors and should produce the probability ratio. The next chapter derives what F must be.
The ratio P(k|ice)/P(k|steam) reveals which probe words discriminate between "ice" and "steam." Ratios far from 1 are discriminative; ratios near 1 are uninformative.
This is the mathematical heart of GloVe. We start from the ratio requirement and derive the training objective step by step.
We want F(wi, wj, w̃k) = Pik/Pjk. Since the ratio captures how word i differs from word j with respect to context k, it should depend on the difference wi − wj:
F takes two vectors (wi − wj and w̃k) and produces a scalar (the ratio). The simplest way: use the dot product.
The right side is a ratio: Pik/Pjk. The left side has a difference: (wi − wj)Tw̃k = wiTw̃k − wjTw̃k. For F to convert a difference (additive) into a ratio (multiplicative), F must be an exponential:
So:
This gives us:
For the ratio to work out term-by-term:
Taking the logarithm:
The term log Xi depends only on word i, not on context k. Absorb it (and log λ) into a bias term bi. For symmetry, add a context bias b̃k:
Train by minimizing the squared error between the model's prediction and the actual log co-occurrence:
where f(Xij) is a weighting function (Chapter 4). The sum is over all non-zero entries of X — typically around 1 billion entries for a large corpus.
Suppose word i = "ice" and word j = "cold" with Xij = 50. Our current parameters:
Prediction:
Target:
Error:
Weight:
Weighted loss:
Gradient for wice:
The gradient is negative — it pushes wice in the direction of w̃cold, increasing their dot product, bringing the prediction closer to log(50) = 3.912. This is exactly what we want: "ice" and "cold" co-occur frequently, so their dot product should be large.
Six steps from the ratio requirement to the final objective. The key: exponential converts additive (vector difference) to multiplicative (probability ratio).
python import numpy as np def glove_loss(W, W_ctx, b, b_ctx, cooccur, f_weights): """ W: (V, d) — word vectors W_ctx: (V, d) — context vectors b: (V,) — word biases b_ctx: (V,) — context biases cooccur: list of (i, j, X_ij) — nonzero entries f_weights: list of f(X_ij) — precomputed weights Returns: total loss """ total_loss = 0.0 for (i, j, x_ij), f_x in zip(cooccur, f_weights): # Model prediction prediction = W[i] @ W_ctx[j] + b[i] + b_ctx[j] # Target target = np.log(x_ij) # Weighted squared error total_loss += f_x * (prediction - target) ** 2 return total_loss
Let's pause and appreciate the elegance of this derivation. We started with a vague requirement — "vectors should capture probability ratios" — and ended with a concrete equation. The key mathematical insight is the homomorphism property: we needed a function F that maps addition (in the argument) to multiplication (in the output). The only continuous function with this property is the exponential. Once you insist on F = exp, the entire model follows.
This is why GloVe produces vectors with linear analogy structure. The exponential function converts additive relationships in the vector space into multiplicative relationships in probability space. Semantic relationships are multiplicative (ratios), so they must be additive in the log-space of word vectors. King − man + woman = queen works because "royalty" and "gender" are independent multiplicative factors in the co-occurrence statistics.
One subtlety in the derivation: the co-occurrence matrix X is symmetric (Xij = Xji for symmetric windows), but the model equation treats wi and w̃j differently. GloVe resolves this by noting that the model should work equally well with roles swapped. This means the word vectors and context vectors should be interchangeable — which is why summing W + W̃ as the final representation makes sense. It explicitly restores the symmetry that the parameterization breaks.
The bias terms bi and b̃j absorb word-frequency information. After training:
This separates frequency information from semantic information. The word vector wi captures what a word means; the bias bi captures how often it appears. Without biases, the word vectors would need to encode both — and the frequency signal would contaminate the semantic structure.
The raw objective J = ∑ (wiTw̃j + bi + b̃j − log Xij)2 has a problem: it treats all co-occurrence counts equally. But a pair that co-occurs 10,000 times should matter more than a pair that co-occurs twice, and a pair that co-occurs 1,000,000 times shouldn't dominate the objective.
The solution is the weighting function f(Xij):
The paper sets xmax = 100 and α = 3/4.
Three design requirements:
The exponent α controls how rapidly the weight grows with co-occurrence count:
The weighting function caps the influence of very frequent co-occurrences. Drag α and x_max to see how the function shape changes.
python import numpy as np def f_weight(x, x_max=100, alpha=0.75): """GloVe weighting function.""" if x >= x_max: return 1.0 return (x / x_max) ** alpha # Examples: # f(1) = (1/100)^0.75 = 0.018 — rare pair: very low weight # f(10) = (10/100)^0.75 = 0.178 — moderate: meaningful weight # f(50) = (50/100)^0.75 = 0.595 — common: substantial weight # f(100) = 1.0 = 1.000 — capped: maximum weight # f(10000) = 1.0 = 1.000 — still capped
Consider four word pairs with different co-occurrence counts:
| Word pair | Xij | f(Xij) | log Xij | Contribution to loss |
|---|---|---|---|---|
| "quantum" + "mechanics" | 3 | (3/100)0.75 = 0.057 | 1.099 | Tiny weight: rare pair, may be noisy |
| "cat" + "animal" | 25 | (25/100)0.75 = 0.354 | 3.219 | Moderate weight: reliable signal |
| "the" + "is" | 500 | 1.0 (capped) | 6.215 | Maximum weight but not overwhelming |
| "the" + "the" | 50,000 | 1.0 (capped) | 10.82 | Same weight as "the+is" despite 100x more frequent |
Without the cap, the ("the", "the") pair would dominate the loss by a factor of 50,000x over ("quantum", "mechanics"). The weighting function compresses this range from 50,000:1 to approximately 17:1 (1.0/0.057). This is still a large range — frequent pairs matter more — but it is manageable.
GloVe's training is simpler than Word2Vec's. There is no neural network, no backpropagation through hidden layers, no softmax. It is a weighted least-squares regression optimized by gradient descent.
For each word i in the vocabulary:
Total parameters: 2V(d + 1). For V = 400,000 and d = 300: about 240 million parameters.
For a single (i, j) entry:
These are simple: error times the other vector (for word/context vectors) or error times 1 (for biases). No chain rule through nonlinearities.
GloVe uses two sets of vectors: word vectors W and context vectors W̃. The objective treats them symmetrically (since X is symmetric for symmetric windows). The paper found that the sum W + W̃ consistently outperforms using either alone. Intuitively, averaging two independent estimates of the same quantity reduces variance.
GloVe uses AdaGrad (Adaptive Gradient) rather than standard SGD. AdaGrad maintains a per-parameter sum of squared gradients and divides the learning rate by the square root of this sum:
where Gt = ∑τ=1t gτ2 is the accumulated squared gradient and gt is the current gradient.
AdaGrad is ideal for GloVe because:
GloVe's training is embarrassingly parallel. Each (i, j, Xij) entry produces an independent gradient. Multiple threads can process different entries simultaneously with minimal synchronization (just atomic updates to shared vectors). The original GloVe implementation achieved near-linear speedup with up to 32 threads.
Word2Vec also parallelizes well (using Hogwild-style asynchronous SGD on the text stream), but GloVe's independence structure is cleaner. This contributed to GloVe's slightly faster wall-clock training times on identical hardware.
A complete GloVe training run on 6B tokens:
| Stage | Time | Output |
|---|---|---|
| 1. Vocabulary construction | ~30 min | 400K words above min-count threshold |
| 2. Corpus scanning | ~2 hours | ~1B nonzero (i, j, X_ij) entries |
| 3. Shuffle entries | ~20 min | Randomized training order |
| 4. Train 50 iterations | ~3 hours (8 threads) | 400K × 300 word vectors |
| 5. Combine W + W_ctx | Seconds | Final vectors |
| Total | ~6 hours | glove.6B.300d.txt (1.04 GB) |
The co-occurrence matrix construction (stage 2) is the most memory-intensive step, requiring ~10 GB of RAM to store the sparse matrix. The training itself (stage 4) is CPU-intensive but memory-light — it only needs the vectors, biases, and AdaGrad accumulators in memory.
GloVe converges smoothly because the objective is a weighted least-squares problem — convex in each variable when others are fixed (biconvex overall). The loss curve is typically monotonically decreasing with occasional plateaus. Early iterations reduce loss rapidly; later iterations fine-tune the geometry. The paper found that 50-100 iterations suffice for most corpora, with diminishing returns beyond 100.
python # Monitoring GloVe training convergence def evaluate_glove(vectors, analogy_test): """Evaluate word vectors on analogy task during training.""" correct = 0 total = 0 for a, b, c, expected in analogy_test: if any(w not in vectors for w in [a, b, c, expected]): continue target = vectors[b] - vectors[a] + vectors[c] target /= np.linalg.norm(target) + 1e-10 best_word, best_sim = None, -1 for word, vec in vectors.items(): if word in {a, b, c}: continue sim = np.dot(target, vec / (np.linalg.norm(vec) + 1e-10)) if sim > best_sim: best_sim, best_word = sim, word if best_word == expected: correct += 1 total += 1 return correct / total if total > 0 else 0 # Typical convergence: accuracy vs. iteration # Iter 1: ~30% accuracy (random-ish vectors) # Iter 10: ~55% accuracy (structure emerging) # Iter 25: ~67% accuracy (most structure captured) # Iter 50: ~71% accuracy (fine-tuning) # Iter 100: ~72% accuracy (diminishing returns)
Watch GloVe converge as we minimize the weighted least-squares loss on a tiny toy corpus. Click "Train" to run gradient descent steps.
python import numpy as np def train_glove(cooccur, V, d=50, epochs=100, lr=0.05, x_max=100, alpha=0.75): """ cooccur: list of (i, j, X_ij) — nonzero entries Returns: word vectors W + W_ctx (sum of both) """ # Initialize W = (np.random.rand(V, d) - 0.5) / d W_ctx = (np.random.rand(V, d) - 0.5) / d b = np.zeros(V) b_ctx = np.zeros(V) # AdaGrad accumulators W_sum = np.ones((V, d)) # init to 1 to avoid division by zero W_ctx_sum = np.ones((V, d)) b_sum = np.ones(V) b_ctx_sum = np.ones(V) for epoch in range(epochs): total_loss = 0.0 np.random.shuffle(cooccur) for i, j, x_ij in cooccur: # Weighting fw = min((x_ij / x_max) ** alpha, 1.0) # Error diff = W[i] @ W_ctx[j] + b[i] + b_ctx[j] - np.log(x_ij) loss = fw * diff ** 2 total_loss += loss # Gradients grad_common = fw * diff grad_w = grad_common * W_ctx[j] grad_wc = grad_common * W[i] # AdaGrad update W_sum[i] += grad_w ** 2 W[i] -= lr * grad_w / np.sqrt(W_sum[i]) W_ctx_sum[j] += grad_wc ** 2 W_ctx[j] -= lr * grad_wc / np.sqrt(W_ctx_sum[j]) b_sum[i] += grad_common ** 2 b[i] -= lr * grad_common / np.sqrt(b_sum[i]) b_ctx_sum[j] += grad_common ** 2 b_ctx[j] -= lr * grad_common / np.sqrt(b_ctx_sum[j]) return W + W_ctx # sum of word and context vectors
python import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader class GloVeDataset(Dataset): def __init__(self, cooccur_data, x_max=100, alpha=0.75): """cooccur_data: list of (i, j, X_ij) tuples""" self.data = cooccur_data self.x_max = x_max self.alpha = alpha def __len__(self): return len(self.data) def __getitem__(self, idx): i, j, x = self.data[idx] # Compute weight weight = (x / self.x_max) ** self.alpha if x < self.x_max else 1.0 return (torch.tensor(i), torch.tensor(j), torch.tensor(x, dtype=torch.float32), torch.tensor(weight, dtype=torch.float32)) class GloVe(nn.Module): def __init__(self, V, d=300): super().__init__() self.W = nn.Embedding(V, d) # word vectors self.W_ctx = nn.Embedding(V, d) # context vectors self.b = nn.Embedding(V, 1) # word biases self.b_ctx = nn.Embedding(V, 1) # context biases # Initialize for param in self.parameters(): nn.init.uniform_(param, -0.5/d, 0.5/d) def forward(self, i, j, x, weight): """ i, j: (batch,) — word and context indices x: (batch,) — co-occurrence counts weight: (batch,) — f(X_ij) weights """ w_i = self.W(i) # (batch, d) w_j = self.W_ctx(j) # (batch, d) b_i = self.b(i).squeeze() # (batch,) b_j = self.b_ctx(j).squeeze() # Prediction: w_i . w_j + b_i + b_j pred = (w_i * w_j).sum(dim=1) + b_i + b_j # Target: log(X_ij) target = torch.log(x) # Weighted least squares loss loss = weight * (pred - target) ** 2 return loss.mean() def get_vectors(self): """Return W + W_ctx as final word vectors.""" return (self.W.weight + self.W_ctx.weight).detach() # Training model = GloVe(V=400000, d=300) optimizer = torch.optim.Adagrad(model.parameters(), lr=0.05) for epoch in range(100): for i, j, x, w in dataloader: loss = model(i, j, x, w) optimizer.zero_grad() loss.backward() optimizer.step() vectors = model.get_vectors() # (400000, 300)
GloVe's paper made a provocative claim: Word2Vec (Skip-gram with negative sampling) is implicitly factorizing a co-occurrence matrix. The two methods are more similar than they appear.
Levy and Goldberg (2014) proved (building on the GloVe paper's insight) that Skip-gram with negative sampling, when trained to convergence, satisfies:
where PMI is the pointwise mutual information:
And GloVe's model says:
These are related! If GloVe's biases absorb the log Xi and log Xj terms:
Then:
Both models learn dot products that approximate PMI (or log co-occurrence). The difference is in how they optimize:
| Property | Word2Vec (Skip-gram NEG) | GloVe |
|---|---|---|
| Training data | Raw text (online, streaming) | Co-occurrence matrix (precomputed) |
| Objective | Binary classification (pos vs neg) | Weighted least squares on log X |
| Implicit target | PMI − log k | log X_ij (with biases absorbing marginals) |
| Zero entries | Handled via negative sampling | Skipped (f(0) = 0) |
| Weighting | Sampling frequency (noise dist) | Explicit f(X_ij) function |
| Optimizer | SGD on streaming text | AdaGrad on shuffled matrix entries |
Despite the mathematical similarity, some practical differences remain:
1. How they handle zero co-occurrences: This is perhaps the most important distinction. GloVe skips all entries where Xij = 0 (since log(0) is undefined). Word2Vec handles zeros implicitly through negative sampling — every negative sample is drawn from the noise distribution, which naturally represents the "background" of non-co-occurring words. GloVe never explicitly learns that two words don't co-occur; Word2Vec does, through its negative samples.
2. Online vs. batch: Word2Vec processes the corpus as a stream. It can train on corpora that don't fit in memory. GloVe must first compute the full co-occurrence matrix, which requires a pass over the corpus and ~10 GB of storage for large vocabularies. For very large, streaming datasets, Word2Vec is more practical.
3. Weighting scheme: GloVe's weighting function f(Xij) explicitly caps the influence of very frequent pairs. Word2Vec's analog is the subsampling of frequent words and the noise distribution exponent (3/4 power). Both achieve similar effects but through different mechanisms.
Consider word i = "cat" and word j = "fluffy" in a corpus where:
GloVe's target:
Skip-gram NEG's implicit target (Levy & Goldberg):
Both methods produce dot products in the same ballpark (~4-6 for this moderately co-occurring pair). The difference is absorbed by GloVe's biases and the constant shifts.
Both methods produce vectors in the same region of vector space. The scatter plot shows GloVe vs. Skip-gram dot products for word pairs — they correlate strongly.
GloVe was evaluated on three tasks: word analogies, word similarity, and named entity recognition (NER).
Using the standard analogy test set (semantic + syntactic, ~19K questions):
| Model | Dim | Training Data | Accuracy % |
|---|---|---|---|
| SVD (on X) | 300 | 6B | 36.7 |
| SVD (on log X) | 300 | 6B | 54.6 |
| Word2Vec (SG) | 300 | 6B | 65.6 |
| Word2Vec (CBOW) | 300 | 6B | 63.6 |
| GloVe | 300 | 6B | 71.7 |
| GloVe | 300 | 42B | 75.0 |
GloVe at 300 dimensions on 6B words outperforms Word2Vec on the same data by 6 points. On 42B words (Common Crawl), it reaches 75% — a large improvement from the 65.6% of Word2Vec on 6B.
Spearman correlation between model's cosine similarities and human similarity judgments on standard datasets (WordSim-353, etc.). GloVe achieved 0.769 on WordSim-353, competitive with the best Word2Vec models.
Using word vectors as features for a CRF-based NER system on CoNLL-2003:
| Features | F1 Score |
|---|---|
| Discrete features only | 88.4 |
| + SVD vectors (d=50) | 89.3 |
| + Word2Vec (d=50) | 90.1 |
| + GloVe (d=50) | 90.5 |
| + GloVe (d=300) | 91.2 |
GloVe vectors as additional features improved NER F1 from 88.4 to 91.2 — a substantial gain from unsupervised pretraining.
The paper studied how performance scales with corpus size, vector dimension, window size, and training time:
Comparison across methods and training data sizes. GloVe consistently outperforms Word2Vec on the same data.
The paper's experiments on Wikipedia+Gigaword 6B showed a clear pattern:
| Dimension d | Analogy accuracy % | Training time (relative) |
|---|---|---|
| 50 | 54.3 | 0.2x |
| 100 | 64.0 | 0.4x |
| 200 | 69.1 | 0.7x |
| 300 | 71.7 | 1.0x |
| 400 | 72.3 | 1.3x |
| 500 | 72.5 | 1.7x |
Going from 50 to 300 dimensions adds 17 points. Going from 300 to 500 adds only 0.8 points. The "sweet spot" is d = 300 — which is why most pre-trained word vectors use this dimension.
An interesting finding: window size affects what the vectors learn:
This makes sense: a word's immediate neighbors are syntactically constrained (adjectives before nouns, determiners before adjectives), while distant words in the same sentence are thematically related. GloVe's default c = 10 favors semantic similarity, which is more useful for most downstream tasks.
The paper tested GloVe on multiple corpora to study the effect of data quality and quantity:
| Corpus | Tokens | Analogy Accuracy % | Notes |
|---|---|---|---|
| Wikipedia 2014 | 1.6B | 64.7 | Clean, encyclopedic text |
| Wikipedia + Gigaword 5 | 6B | 71.7 | Clean + newswire |
| Common Crawl (42B) | 42B | 75.0 | Noisy but massive |
| Common Crawl (840B) | 840B | — | Even noisier; used for released vectors |
Two observations: (1) More data always helps, even when the additional data is noisy web text. (2) Clean data is more efficient per token — Wikipedia's 1.6B tokens give 64.7%, while you need 6B tokens of mixed-quality data to reach 71.7%. Quality matters, but quantity can compensate.
While GloVe produces word-level vectors, they can be composed into sentence or document representations:
python # Simple document vector: average word vectors def doc_vector(words, glove, dim=300): vecs = [glove[w] for w in words if w in glove] if len(vecs) == 0: return np.zeros(dim) return np.mean(vecs, axis=0) # TF-IDF weighted average (better for information retrieval) def weighted_doc_vector(words, glove, idf_weights, dim=300): vecs, weights = [], [] for w in words: if w in glove and w in idf_weights: vecs.append(glove[w]) weights.append(idf_weights[w]) if len(vecs) == 0: return np.zeros(dim) return np.average(vecs, axis=0, weights=weights)
Simple averaging works surprisingly well for short texts (tweets, sentences). For longer documents, TF-IDF weighting or SIF (Smooth Inverse Frequency) weighting by Arora et al. (2017) gives significant improvements by down-weighting common words.
Having studied both methods in detail, here is a comprehensive comparison:
| Aspect | Word2Vec (Skip-gram NEG) | GloVe |
|---|---|---|
| Objective | Binary classification (pos vs neg) | Weighted least squares on log X |
| Data format | Raw text corpus (streaming) | Co-occurrence matrix (precomputed) |
| Zero co-occurrences | Handled via negative sampling | Skipped (f(0) = 0) |
| Distance weighting | None within window | 1/d harmonic weighting |
| Frequency handling | Subsampling + noise dist f^0.75 | Weighting function f(X) = (X/100)^0.75 |
| Optimizer | SGD with linear LR decay | AdaGrad |
| Parallelism | Hogwild (async, lock-free) | Embarrassingly parallel (independent entries) |
| Memory | O(V · d) — just embeddings | O(V · d + nnz) — embeddings + sparse matrix |
| Final vectors | W_in (input embeddings) | W + W_ctx (sum of both) |
| Best accuracy (6B) | 66% (d=1000) | 72% (d=300) |
| Best accuracy (42B) | ~70% (estimated) | 75% (d=300) |
In practice, the difference between GloVe and Word2Vec is small on most downstream tasks. The choice often depends on:
Both GloVe and Word2Vec share fundamental limitations that contextual embeddings (ELMo, BERT) would later address:
These limitations motivated the move to contextualized representations, where each token gets a different vector depending on the full input sequence. But GloVe's core insight — that ratios of co-occurrence probabilities encode meaning — remains foundational to understanding how all embedding methods work.
GloVe's ideas have been absorbed into modern deep learning in subtle ways:
In modern NLP, the choice is usually straightforward:
| Situation | Recommendation | Why |
|---|---|---|
| Full NLP pipeline | Use a transformer (BERT, LLaMA) | Contextual embeddings dominate on every benchmark |
| Simple baseline / prototype | Pre-trained GloVe 300d | Free, fast, no GPU needed, surprisingly competitive |
| Domain-specific (e.g., biomedical) | Train Word2Vec on domain corpus | Easy to train, captures domain terminology |
| Need OOV handling | FastText | Character n-grams handle any word |
| Multilingual | Multilingual BERT or FastText | Cross-lingual transfer |
| Edge deployment (no GPU) | Pre-trained GloVe 50d or 100d | Tiny memory footprint, CPU-friendly |
GloVe and Word2Vec remain relevant as baselines, for education, and for resource-constrained settings. Their simplicity — just a lookup table of vectors — makes them deployable anywhere, including embedded devices, browsers, and mobile apps. A 50d GloVe model fits in 80 MB; a 300d BERT model requires 400 MB. For many applications, the simpler model is good enough.
The analogy task (king − man + woman = queen) became the standard evaluation, but it has limitations. The paper also evaluated on several other tasks:
Word similarity: Compute cosine similarity between word pairs and compare to human judgments using Spearman rank correlation. Datasets include WordSim-353, SimLex-999, and MEN. GloVe achieves 0.77 on WordSim-353 (human agreement is ~0.75).
Extrinsic evaluation: Use word vectors as features for a downstream task and measure task performance. The paper chose NER because it is well-understood and has standard benchmarks. The +2.8 F1 improvement from GloVe vectors (88.4 → 91.2) demonstrated that intrinsic quality (analogy accuracy) translates to extrinsic performance.
python # Evaluating word vectors on similarity task import numpy as np from scipy.stats import spearmanr def evaluate_similarity(vectors, test_pairs): """ test_pairs: list of (word1, word2, human_score) tuples Returns: Spearman correlation with human judgments """ model_sims, human_sims = [], [] for w1, w2, human in test_pairs: if w1 not in vectors or w2 not in vectors: continue v1, v2 = vectors[w1], vectors[w2] cos_sim = v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2)) model_sims.append(cos_sim) human_sims.append(human) rho, p_value = spearmanr(model_sims, human_sims) return rho # Typical results: # GloVe 300d on WordSim-353: rho = 0.769 # Word2Vec 300d on WordSim-353: rho = 0.720 # Human agreement: rho ≈ 0.750
Building and examining the co-occurrence matrix reveals fascinating properties of natural language:
These statistical properties explain why GloVe's design choices work: the weighting function handles the power-law distribution, the log transform handles the vast range of counts, and the biases handle the variation in word frequencies. Every design decision in GloVe is motivated by the empirical statistics of natural language co-occurrence.
Beyond the practical algorithm, GloVe made an important theoretical contribution to understanding word vectors. The paper argued that the right way to think about word meaning is through ratios of co-occurrence probabilities, not through raw probabilities or raw counts. This perspective has three consequences:
The paper's equation F(wi − wj, w̃k) = Pik/Pjk is one of the most cited equations in NLP. It answers a question that Word2Vec left open: why do word vectors have linear structure? GloVe's answer: because the underlying semantic signal (co-occurrence ratios) is inherently log-linear, and the training objective preserves this structure.
GloVe became one of the most widely used word embedding methods. Here are the practical details for training and using GloVe vectors.
| Hyperparameter | Default | Effect |
|---|---|---|
| Dimension d | 300 | Higher is better up to ~300, then diminishing returns |
| Window size c | 10 | Larger for semantic tasks, smaller for syntactic |
| xmax | 100 | Cap for weighting function; higher = more weight to frequent pairs |
| α | 0.75 | Sublinear weighting exponent; 0.75 works well universally |
| Learning rate | 0.05 | With AdaGrad, which adapts per-parameter |
| Epochs | 50-100 | More for smaller corpora; 1 pass suffices for very large data |
| Min count | 5 | Words appearing fewer times are discarded |
Stanford released pre-trained GloVe vectors that became the standard initialization for NLP models from 2014-2018:
| Dataset | Tokens | Vocab | Dims available |
|---|---|---|---|
| Wikipedia + Gigaword 5 | 6B | 400K | 50, 100, 200, 300 |
| Common Crawl (uncased) | 42B | 1.9M | 300 |
| Common Crawl (cased) | 840B | 2.2M | 300 |
| 27B | 1.2M | 25, 50, 100, 200 |
python import numpy as np def load_glove(filepath, dim=300): """Load pre-trained GloVe vectors from text file.""" word2vec = {} with open(filepath, 'r', encoding='utf-8') as f: for line in f: parts = line.strip().split() word = parts[0] vec = np.array([float(x) for x in parts[1:]]) word2vec[word] = vec return word2vec # Usage glove = load_glove('glove.6B.300d.txt') print(glove['king'].shape) # (300,) # Cosine similarity def cosine_sim(a, b): return a @ b / (np.linalg.norm(a) * np.linalg.norm(b)) print(cosine_sim(glove['king'], glove['queen'])) # ~0.75 print(cosine_sim(glove['king'], glove['banana'])) # ~0.12 # As PyTorch embedding initialization import torch import torch.nn as nn embedding = nn.Embedding(V, 300) for word, idx in vocab.items(): if word in glove: embedding.weight.data[idx] = torch.tensor(glove[word]) # Freeze or fine-tune depending on task and dataset size
When using pre-trained GloVe vectors as initialization for a downstream task, you have two options:
| Strategy | When to use | Pros | Cons |
|---|---|---|---|
| Freeze | Small dataset (<10K examples) | Prevents overfitting; preserves pretrained quality | Cannot adapt to domain-specific usage |
| Fine-tune | Large dataset (>100K examples) | Adapts vectors to your specific task | May overfit on small data; loses general knowledge |
A common compromise: start with frozen embeddings for a few epochs (letting the rest of the model warm up), then unfreeze and fine-tune with a small learning rate.
python # Common pattern: freeze then fine-tune import torch.nn as nn embedding = nn.Embedding(V, 300) # Load GloVe vectors for word, idx in vocab.items(): if word in glove: embedding.weight.data[idx] = torch.tensor(glove[word]) # Phase 1: Freeze embeddings, train classifier layers embedding.weight.requires_grad = False for epoch in range(5): train_epoch(model) # only classifier weights update # Phase 2: Unfreeze, fine-tune everything with low LR embedding.weight.requires_grad = True optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) # 10x lower LR for epoch in range(10): train_epoch(model) # all weights update, including embeddings
GloVe vectors are distributed as plain text files — one word per line, space-separated values. This is simpler than Word2Vec's binary format (which uses struct packing) but larger on disk.
text # Format of glove.6B.300d.txt: # Each line: word val1 val2 ... val300 the 0.418 0.249 -0.412 0.122 ... (300 floats) , 0.013 0.189 -0.350 0.076 ... . 0.152 0.304 -0.134 0.014 ... of 0.370 0.210 -0.310 0.328 ... # ... 400,000 lines total # File size: 1.04 GB for 300d, 347 MB for 100d
GloVe has no mechanism for words not in the pre-trained vocabulary. Common strategies for handling OOV words:
Loading GloVe vectors requires substantial memory:
For deployment in memory-constrained environments, techniques like quantization (float16 or int8) or dimensionality reduction (PCA from 300d to 100d) are common.
Type a word to see its nearest neighbors in GloVe space (simulated with common examples). Click preset words to explore.
Latent Semantic Analysis (Deerwester et al., 1990): SVD on a term-document matrix. GloVe improved on LSA by using a word-word matrix instead of word-document, log co-occurrence instead of raw counts, and a weighted objective instead of unweighted SVD.
Word2Vec (Mikolov et al., 2013): The direct predecessor and competitor. GloVe explicitly positions itself as a synthesis of Word2Vec's strengths (local context prediction, linear structure) and LSA's strengths (global statistics). See our Word2Vec veanor and negative sampling veanor.
HAL (Hyperspace Analogue to Language, Lund & Burgess, 1996): An early word-word co-occurrence approach. GloVe's matrix is similar but uses harmonic weighting and a principled objective.
A natural question: if GloVe trains on the co-occurrence matrix X, why not just do SVD directly on X (or log X)? The paper tested this and found several reasons GloVe wins:
python # Comparison: SVD on log(X) vs. GloVe from scipy.sparse.linalg import svds import numpy as np from scipy.sparse import csr_matrix # SVD approach (log transform, shift by 1 to handle zeros) X_log = np.log(X.toarray() + 1) # (V, V) U, S, Vt = svds(csr_matrix(X_log), k=300) # truncated SVD svd_vectors = U * np.sqrt(S) # (V, 300) # SVD accuracy: ~55% on analogies # GloVe accuracy: ~72% on analogies # Why the gap? Weighting, biases, and skip-zero optimization
The embedding revolution (2014-2018): GloVe's pre-trained 300d vectors became the default initialization for nearly every NLP model: sentiment analysis, NER, relation extraction, question answering. The idea of "transfer learning from unsupervised pretraining" was proven by GloVe and Word2Vec before it was scaled up by ELMo and BERT.
Levy & Goldberg (2014): Formalized the Word2Vec-GloVe connection, showing both implicitly factorize PMI matrices. This unified the count-predict debate and led to a deeper understanding of what makes word embeddings work.
FastText (2017), ELMo (2018), BERT (2018): Each extended the word embedding paradigm in its own direction. FastText added subwords. ELMo added context-dependence. BERT added bidirectional context with transformers. All stood on the foundation that GloVe and Word2Vec established.
| Era | Method | Representation | Key property |
|---|---|---|---|
| 1990s | LSA (SVD on counts) | Static, count-based | Global statistics, poor analogies |
| 2013 | Word2Vec | Static, prediction-based | Linear analogies, local context |
| 2014 | GloVe | Static, hybrid | Global + local, best analogies |
| 2017 | FastText | Static, subword-based | OOV handling, morphology |
| 2018 | ELMo | Contextual (LSTM) | Different vectors per context |
| 2018 | BERT | Contextual (Transformer) | Bidirectional, fine-tunable |
| 2020+ | GPT-3/4, LLaMA | Contextual (large Transformer) | In-context learning, emergent abilities |
Each step built on the previous. GloVe showed that the right objective matters more than the algorithm. BERT showed that context matters more than static vectors. GPT-4 showed that scale matters more than architecture. But the foundational insight — that useful representations emerge from prediction tasks on raw text — traces directly back to Word2Vec and GloVe.
GloVe's weighted least-squares objective influenced later work in surprising ways:
bash # The official GloVe training pipeline (Stanford release) # Download from https://nlp.stanford.edu/projects/glove/ # Step 1: Build vocabulary ./vocab_count -min-count 5 < corpus.txt > vocab.txt # Output: 400,000 words above threshold # Step 2: Build co-occurrence matrix ./cooccur -window-size 10 -vocab-file vocab.txt < corpus.txt > cooccur.bin # Output: ~1B nonzero entries, ~10 GB binary file # Step 3: Shuffle (for SGD convergence) ./shuffle < cooccur.bin > cooccur.shuf.bin # Step 4: Train ./glove -vector-size 300 -threads 8 -iter 50 \ -eta 0.05 -alpha 0.75 -x-max 100 \ -input-file cooccur.shuf.bin -vocab-file vocab.txt \ -save-file vectors # Output: vectors.txt (word + 300 floats per line)
python # Using pre-trained GloVe with numpy (no dependencies) import numpy as np def load_glove(path): vectors = {} with open(path, 'r', encoding='utf-8') as f: for line in f: parts = line.strip().split() vectors[parts[0]] = np.array([float(x) for x in parts[1:]]) return vectors # Common operations glove = load_glove('glove.6B.300d.txt') # Cosine similarity def sim(a, b): return glove[a] @ glove[b] / (np.linalg.norm(glove[a]) * np.linalg.norm(glove[b])) print(f"king-queen: {sim('king','queen'):.3f}") # 0.751 print(f"cat-dog: {sim('cat','dog'):.3f}") # 0.762 # Analogy: king - man + woman = ? target = glove['king'] - glove['man'] + glove['woman'] target /= np.linalg.norm(target) best = max( ((w, target @ v / np.linalg.norm(v)) for w, v in glove.items() if w not in {'king', 'man', 'woman'}), key=lambda x: x[1] ) print(f"king - man + woman = {best[0]} ({best[1]:.3f})") # king - man + woman = queen (0.891)