GloVe — Veanors

Chapter 1: The Co-occurrence Matrix

GloVe starts by building a word-word co-occurrence matrix X from the entire corpus. Entry X_ij counts how many times word j appears in the context of word i.

Building the matrix

Slide a symmetric window of size c around each word. For each (center, context) pair, increment X_{center, context}. But with a twist: GloVe uses harmonic weighting — context words that are farther from the center contribute less. A word at distance d contributes 1/d to the count.

X_ij = ∑_{all positions where i is center} ∑_d=1^c (1/d) · [word at distance d is j]

For the sentence "the cat sat on the mat" with c = 2:

"cat" at position 1, window: the(d=1, weight 1), sat(d=1, weight 1)
"sat" at position 2, window: the(d=2, weight 0.5), cat(d=1, weight 1), on(d=1, weight 1), the(d=2, weight 0.5)

After processing the entire corpus, X is a V × V matrix. It is sparse — most word pairs never co-occur — and symmetric (X_ij = X_ji for symmetric windows).

Key derived quantities

X_i = ∑_k X_ik

The total co-occurrence count for word i (sum of row i).

P_ij = P(j | i) = X_ij / X_i

The probability that word j appears in the context of word i.

Sparsity and scale

For a real corpus:

Corpus	Vocab V	Matrix size V²	Nonzero entries	Density
Wikipedia 6B	400K	160 billion	~1 billion	0.6%
Common Crawl 42B	1.9M	3.6 trillion	~10 billion	0.0003%

The matrix is extremely sparse. GloVe only processes nonzero entries, which is why it scales to enormous vocabularies. The co-occurrence matrix is stored as a sparse data structure (list of (i, j, X_ij) triples), not as a dense V × V array.

Worked example

For the sentence "the cat sat on the mat" with c = 2 and harmonic weighting:

X[cat, sat] = 1/1 = 1.0 (distance 1)

X[cat, the] = 1/1 = 1.0 (distance 1, first "the")

X[cat, on] = 1/2 = 0.5 (distance 2)

After scanning the entire sentence, each cell accumulates contributions from all co-occurrence events. The harmonic weighting 1/d naturally down-weights distant context words.

Why harmonic weighting? A word 5 positions away from the center is weaker context than a word 1 position away. The 1/d weighting captures this. Without it, all positions within the window are treated equally, losing positional information. This is GloVe's equivalent of Word2Vec's uniform window — but GloVe encodes distance information in the co-occurrence counts themselves.

Co-occurrence Matrix Builder

Watch the matrix get built as we slide a window across a sentence. Brighter cells = higher co-occurrence counts. Drag the window size to see how it affects the matrix.

Window c 2

python
import numpy as np
from collections import defaultdict

def build_cooccurrence(corpus, vocab, window=10):
    """
    corpus: list of list of int (tokenized, vocab-indexed sentences)
    vocab: dict mapping word -> index
    window: context window size
    Returns: sparse co-occurrence matrix (dict of dicts)
    """
    V = len(vocab)
    cooccur = defaultdict(lambda: defaultdict(float))

    for sentence in corpus:
        for i, center in enumerate(sentence):
            for d in range(1, window + 1):
                weight = 1.0 / d  # harmonic weighting
                for offset in [-d, d]:
                    j = i + offset
                    if 0 <= j < len(sentence):
                        context = sentence[j]
                        cooccur[center][context] += weight

    return cooccur  # cooccur[i][j] = X_ij

# For a corpus of 6B tokens with V=400K words:
# - X has ~400K x 400K = 160B entries
# - But only ~1B are nonzero (0.6% density)
# - Stored as sparse matrix: ~10 GB

Comparison: GloVe vs. Word2Vec co-occurrence

Both methods use a sliding context window, but handle the counts differently:

Property	Word2Vec (Skip-gram)	GloVe
Counting	Implicit (each pair is a training example)	Explicit (build X matrix first)
Distance weighting	None (all positions in window treated equally)	1/d harmonic weighting
Frequency weighting	Subsampling frequent words + noise dist f^0.75	f(X_ij) = min((X/x_max)^0.75, 1)
Memory	O(V · d) for embeddings only	O(nonzero entries) for X + O(V · d) for vectors
Passes over data	Stream corpus once (or few times)	One pass to build X, then iterate on X

Why does GloVe use 1/d harmonic weighting when building the co-occurrence matrix?

Words closer to the center provide stronger context signal, so they should contribute more to the co-occurrence count. A word 1 position away contributes 1.0 while a word 5 positions away contributes only 0.2 It makes the matrix symmetric It reduces the total number of non-zero entries

Chapter 2: The Ratio Insight

This chapter contains GloVe's most important contribution: the insight that word meaning is encoded not in raw co-occurrence probabilities, but in their ratios.

The ice/steam example

Consider two target words: "ice" and "steam." We want to learn vectors that capture their relationship. Let's look at their co-occurrence probabilities with various probe words k:

Probe word k	P(k \| ice)	P(k \| steam)	P(k \| ice) / P(k \| steam)
solid	1.9 × 10⁻⁴	2.2 × 10⁻⁵	8.9 (large — "solid" is much more ice-like)
gas	6.6 × 10⁻⁵	7.8 × 10⁻⁴	0.085 (small — "gas" is much more steam-like)
water	3.0 × 10⁻³	2.2 × 10⁻³	1.36 (near 1 — "water" is related to both)
fashion	1.7 × 10⁻⁵	1.8 × 10⁻⁵	0.96 (near 1 — "fashion" is unrelated to both)

The raw probabilities P(k | ice) and P(k | steam) are hard to interpret in isolation — they're tiny numbers that depend on the overall frequency of each word. But the ratio tells a clear story:

Ratio ≫ 1: The probe word k is much more associated with "ice" than "steam" (e.g., "solid" → 8.9)
Ratio ≪ 1: The probe word k is much more associated with "steam" than "ice" (e.g., "gas" → 0.085)
Ratio ≈ 1: The probe word k is equally associated with both (either related to both, like "water", or unrelated to both, like "fashion")

The key insight: The information that distinguishes "ice" from "steam" is encoded in the ratio of co-occurrence probabilities, not the probabilities themselves. GloVe's objective is designed to learn word vectors that capture these ratios. This is why GloVe produces vectors with linear analogy structure — the ratio is a multiplicative relationship, and in log space, it becomes additive. Additive relationships in log space = linear structure in the embedding.

More ratio examples

The ratio insight generalizes beyond ice/steam. Consider "cat" vs. "dog":

Probe word k	P(k \| cat)	P(k \| dog)	Ratio	Interpretation
purr	5.2 × 10⁻⁵	1.3 × 10⁻⁶	40	Strongly cat-like
bark	2.1 × 10⁻⁶	8.7 × 10⁻⁵	0.024	Strongly dog-like
pet	4.1 × 10⁻⁴	3.8 × 10⁻⁴	1.08	Equally related to both
algorithm	1.0 × 10⁻⁶	0.9 × 10⁻⁶	1.11	Unrelated to both

The ratio cleanly separates four cases: (1) cat-specific words (ratio ≫ 1), (2) dog-specific words (ratio ≪ 1), (3) shared pet words (ratio ≈ 1, large probabilities), and (4) irrelevant words (ratio ≈ 1, tiny probabilities). No other statistic captures this distinction so cleanly.

Why raw probabilities fail

If we tried to use raw P(k | ice) instead of ratios, we'd face a problem: the probabilities depend on the overall frequency of the target word. A very common word like "the" has large P(k | "the") for almost every k — not because "the" is semantically related to everything, but because it appears in so many contexts. The ratio P(k | ice) / P(k | steam) cancels out this frequency effect, isolating the semantic signal.

From ratios to vectors

GloVe's design requirement: the word vectors should encode the co-occurrence probability ratios. Specifically, we want a function F such that:

F(w_i, w_j, w̃_k) = P_ik / P_jk

where w_i, w_j are target word vectors and w̃_k is a context word vector. The function F takes three vectors and should produce the probability ratio. The next chapter derives what F must be.

Co-occurrence Probability Ratios

The ratio P(k|ice)/P(k|steam) reveals which probe words discriminate between "ice" and "steam." Ratios far from 1 are discriminative; ratios near 1 are uninformative.

Why are co-occurrence probability ratios more informative than raw probabilities for distinguishing word meanings?

Raw probabilities are tiny and depend on word frequency. Ratios cancel out the frequency effect and directly encode whether a probe word is more associated with one word or the other — ratio >> 1 or << 1 means discriminative, ratio ~ 1 means uninformative Ratios are easier to compute Raw probabilities are always zero for rare words

Chapter 3: Deriving the Objective

This is the mathematical heart of GloVe. We start from the ratio requirement and derive the training objective step by step.

Step 1: The ratio should depend on vector differences

We want F(w_i, w_j, w̃_k) = P_ik/P_jk. Since the ratio captures how word i differs from word j with respect to context k, it should depend on the difference w_i − w_j:

F(w_i − w_j, w̃_k) = P_ik / P_jk

Step 2: The output is a scalar

F takes two vectors (w_i − w_j and w̃_k) and produces a scalar (the ratio). The simplest way: use the dot product.

F((w_i − w_j)^T w̃_k) = P_ik / P_jk

Step 3: F must be a homomorphism

The right side is a ratio: P_ik/P_jk. The left side has a difference: (w_i − w_j)^Tw̃_k = w_i^Tw̃_k − w_j^Tw̃_k. For F to convert a difference (additive) into a ratio (multiplicative), F must be an exponential:

F = exp

So:

exp(w_i^Tw̃_k − w_j^Tw̃_k) = P_ik / P_jk

This gives us:

exp(w_i^Tw̃_k) / exp(w_j^Tw̃_k) = P_ik / P_jk

Step 4: Match individual terms

For the ratio to work out term-by-term:

exp(w_i^Tw̃_k) = λ · P_ik

Taking the logarithm:

w_i^Tw̃_k = log P_ik + log λ

w_i^Tw̃_k = log X_ik − log X_i + log λ

Step 5: Absorb constants into biases

The term log X_i depends only on word i, not on context k. Absorb it (and log λ) into a bias term b_i. For symmetry, add a context bias b̃_k:

w_i^Tw̃_k + b_i + b̃_k = log X_ik

This is the GloVe model: The dot product of word vector w_i and context vector w̃_k, plus biases, should equal the log of the co-occurrence count. It's a log-bilinear regression model. The entire derivation started from "vectors should capture probability ratios" and arrived at "dot product should approximate log co-occurrence counts."

Step 6: The least-squares objective

Train by minimizing the squared error between the model's prediction and the actual log co-occurrence:

J = ∑_i,j=1^V f(X_ij) · (w_i^Tw̃_j + b_i + b̃_j − log X_ij)²

where f(X_ij) is a weighting function (Chapter 4). The sum is over all non-zero entries of X — typically around 1 billion entries for a large corpus.

Worked example: one gradient step

Suppose word i = "ice" and word j = "cold" with X_ij = 50. Our current parameters:

w_ice = [0.3, −0.1], w̃_cold = [0.5, 0.2], b_ice = 1.0, b̃_cold = 0.5

Prediction:

w_ice^T w̃_cold + b_ice + b̃_cold = (0.15 − 0.02) + 1.0 + 0.5 = 1.63

Target:

log X_ij = log(50) = 3.912

Error:

diff = 1.63 − 3.912 = −2.282

Weight:

f(50) = (50/100)^0.75 = 0.5^0.75 = 0.595

Weighted loss:

L = 0.595 × (−2.282)² = 0.595 × 5.207 = 3.098

Gradient for w_ice:

∂L/∂w_ice = f(50) × diff × w̃_cold = 0.595 × (−2.282) × [0.5, 0.2] = [−0.679, −0.272]

The gradient is negative — it pushes w_ice in the direction of w̃_cold, increasing their dot product, bringing the prediction closer to log(50) = 3.912. This is exactly what we want: "ice" and "cold" co-occur frequently, so their dot product should be large.

The GloVe Derivation Flow

Six steps from the ratio requirement to the final objective. The key: exponential converts additive (vector difference) to multiplicative (probability ratio).

python
import numpy as np

def glove_loss(W, W_ctx, b, b_ctx, cooccur, f_weights):
    """
    W:       (V, d) — word vectors
    W_ctx:   (V, d) — context vectors
    b:       (V,)   — word biases
    b_ctx:   (V,)   — context biases
    cooccur: list of (i, j, X_ij) — nonzero entries
    f_weights: list of f(X_ij) — precomputed weights
    Returns: total loss
    """
    total_loss = 0.0
    for (i, j, x_ij), f_x in zip(cooccur, f_weights):
        # Model prediction
        prediction = W[i] @ W_ctx[j] + b[i] + b_ctx[j]

        # Target
        target = np.log(x_ij)

        # Weighted squared error
        total_loss += f_x * (prediction - target) ** 2

    return total_loss

Why the derivation works

Let's pause and appreciate the elegance of this derivation. We started with a vague requirement — "vectors should capture probability ratios" — and ended with a concrete equation. The key mathematical insight is the homomorphism property: we needed a function F that maps addition (in the argument) to multiplication (in the output). The only continuous function with this property is the exponential. Once you insist on F = exp, the entire model follows.

This is why GloVe produces vectors with linear analogy structure. The exponential function converts additive relationships in the vector space into multiplicative relationships in probability space. Semantic relationships are multiplicative (ratios), so they must be additive in the log-space of word vectors. King − man + woman = queen works because "royalty" and "gender" are independent multiplicative factors in the co-occurrence statistics.

The symmetry argument

One subtlety in the derivation: the co-occurrence matrix X is symmetric (X_ij = X_ji for symmetric windows), but the model equation treats w_i and w̃_j differently. GloVe resolves this by noting that the model should work equally well with roles swapped. This means the word vectors and context vectors should be interchangeable — which is why summing W + W̃ as the final representation makes sense. It explicitly restores the symmetry that the parameterization breaks.

What the biases learn

The bias terms b_i and b̃_j absorb word-frequency information. After training:

b_i ≈ log(X_i) ≈ log(frequency of word i)

This separates frequency information from semantic information. The word vector w_i captures what a word means; the bias b_i captures how often it appears. Without biases, the word vectors would need to encode both — and the frequency signal would contaminate the semantic structure.

What is the GloVe model equation, and what does it say in plain English?

w_i · w_k + b_i + b_k = log(X_ij) — the dot product of a word vector and context vector plus biases should equal the log co-occurrence count. It's a log-bilinear model derived from the requirement that vectors capture probability ratios The model predicts the probability of the next word using softmax The model factorizes the raw co-occurrence matrix X directly

Chapter 4: The Weighting Function

The raw objective J = ∑ (w_i^Tw̃_j + b_i + b̃_j − log X_ij)² has a problem: it treats all co-occurrence counts equally. But a pair that co-occurs 10,000 times should matter more than a pair that co-occurs twice, and a pair that co-occurs 1,000,000 times shouldn't dominate the objective.

The solution is the weighting function f(X_ij):

f(x) = (x / x_max)^α if x < x_max

f(x) = 1 if x ≥ x_max

The paper sets x_max = 100 and α = 3/4.

Why this specific function?

Three design requirements:

f(0) = 0. If two words never co-occur, the entry shouldn't contribute to the loss at all. (We skip zero entries.) This is critical because the matrix is sparse — most entries are zero.
f(x) should be non-decreasing. More frequent co-occurrences should get at least as much weight. A pair that co-occurs 100 times provides more reliable statistics than a pair that co-occurs 3 times.
f(x) should not be too large for very high x. Ultra-frequent pairs like ("the", "of") shouldn't overwhelm the objective. The cap at f(x) = 1 prevents this.

The effect of α

The exponent α controls how rapidly the weight grows with co-occurrence count:

α = 1 (linear): Weight grows proportionally to count. Frequent pairs dominate.
α = 0 (constant): All non-zero pairs get equal weight. Noise from rare pairs.
α = 3/4 (GloVe's choice): Sweet spot. Moderate co-occurrences get substantial weight; rare pairs get some weight but don't overwhelm; very frequent pairs are capped.

Why 3/4 again? Both GloVe's weighting and Word2Vec's negative sampling noise distribution use the 3/4 exponent. This is not a coincidence. In both cases, the 3/4 power provides a sublinear scaling that prevents frequent items from dominating while still giving them more influence than rare items. The specific value 3/4 was found empirically in both papers.

Weighting Function f(x)

The weighting function caps the influence of very frequent co-occurrences. Drag α and x_max to see how the function shape changes.

α 0.75

x_max 100

python
import numpy as np

def f_weight(x, x_max=100, alpha=0.75):
    """GloVe weighting function."""
    if x >= x_max:
        return 1.0
    return (x / x_max) ** alpha

# Examples:
# f(1)   = (1/100)^0.75   = 0.018  — rare pair: very low weight
# f(10)  = (10/100)^0.75  = 0.178  — moderate: meaningful weight
# f(50)  = (50/100)^0.75  = 0.595  — common: substantial weight
# f(100) = 1.0            = 1.000  — capped: maximum weight
# f(10000) = 1.0          = 1.000  — still capped

Worked example: f(x) for different co-occurrence levels

Consider four word pairs with different co-occurrence counts:

Word pair	X_ij	f(X_ij)	log X_ij	Contribution to loss
"quantum" + "mechanics"	3	(3/100)^0.75 = 0.057	1.099	Tiny weight: rare pair, may be noisy
"cat" + "animal"	25	(25/100)^0.75 = 0.354	3.219	Moderate weight: reliable signal
"the" + "is"	500	1.0 (capped)	6.215	Maximum weight but not overwhelming
"the" + "the"	50,000	1.0 (capped)	10.82	Same weight as "the+is" despite 100x more frequent

Without the cap, the ("the", "the") pair would dominate the loss by a factor of 50,000x over ("quantum", "mechanics"). The weighting function compresses this range from 50,000:1 to approximately 17:1 (1.0/0.057). This is still a large range — frequent pairs matter more — but it is manageable.

What problem does the weighting function f(X_ij) solve in the GloVe objective?

Without weighting, ultra-frequent pairs like ("the", "of") would dominate the loss, while rare but informative pairs would be ignored. The capped, sublinear f(x) = (x/x_max)^0.75 gives balanced influence to all co-occurrence levels It makes the matrix symmetric It converts co-occurrence counts to probabilities

Chapter 5: Training GloVe

GloVe's training is simpler than Word2Vec's. There is no neural network, no backpropagation through hidden layers, no softmax. It is a weighted least-squares regression optimized by gradient descent.

Parameters

For each word i in the vocabulary:

Word vector w_i ∈ R^d
Context vector w̃_i ∈ R^d
Word bias b_i ∈ R
Context bias b̃_i ∈ R

Total parameters: 2V(d + 1). For V = 400,000 and d = 300: about 240 million parameters.

Training procedure

Step 1: Build X

Scan corpus once to build co-occurrence matrix. Store only nonzero entries (~1B for large corpora).

↓

Step 2: Initialize

Random init for all w, w̃, b, b̃. Small values, uniform or normal distribution.

↓

Step 3: Iterate

For each nonzero X_ij: compute gradient of f(X_ij)(w_i · w̃_j + b_i + b̃_j − log X_ij)². Update by AdaGrad.

↻ 50-100 epochs

Step 4: Combine

Final vectors: W + W̃. Sum of word and context vectors (exploits symmetry of X).

The gradient

For a single (i, j) entry:

∂J/∂w_i = f(X_ij) · (w_i^Tw̃_j + b_i + b̃_j − log X_ij) · w̃_j

∂J/∂w̃_j = f(X_ij) · (w_i^Tw̃_j + b_i + b̃_j − log X_ij) · w_i

∂J/∂b_i = ∂J/∂b̃_j = f(X_ij) · (w_i^Tw̃_j + b_i + b̃_j − log X_ij)

These are simple: error times the other vector (for word/context vectors) or error times 1 (for biases). No chain rule through nonlinearities.

Why W + W̃?

GloVe uses two sets of vectors: word vectors W and context vectors W̃. The objective treats them symmetrically (since X is symmetric for symmetric windows). The paper found that the sum W + W̃ consistently outperforms using either alone. Intuitively, averaging two independent estimates of the same quantity reduces variance.

Why AdaGrad?

GloVe uses AdaGrad (Adaptive Gradient) rather than standard SGD. AdaGrad maintains a per-parameter sum of squared gradients and divides the learning rate by the square root of this sum:

θ_t+1 = θ_t − η / √(G_t + ε) · g_t

where G_t = ∑_τ=1^t g_τ² is the accumulated squared gradient and g_t is the current gradient.

AdaGrad is ideal for GloVe because:

Rare words get larger updates. A word that appears infrequently has small G_t, so its effective learning rate is large. This compensates for having fewer training examples.
Frequent words get smaller updates. A word like "the" has been updated millions of times, accumulating a large G_t. Its effective learning rate is tiny, preventing oscillation.
Matches GloVe's data distribution. The co-occurrence matrix has a Zipfian distribution — a few entries are huge, most are small. AdaGrad naturally adapts to this imbalance.

GloVe vs. Word2Vec training: Word2Vec trains online — each word updates the model as it's read from the corpus. GloVe is batch — it first counts everything, then optimizes. This means GloVe can be parallelized more easily (each nonzero entry is an independent training example) and converges more predictably. The downside: you need to store the co-occurrence matrix (tens of GB for large corpora).

Parallelism

GloVe's training is embarrassingly parallel. Each (i, j, X_ij) entry produces an independent gradient. Multiple threads can process different entries simultaneously with minimal synchronization (just atomic updates to shared vectors). The original GloVe implementation achieved near-linear speedup with up to 32 threads.

Word2Vec also parallelizes well (using Hogwild-style asynchronous SGD on the text stream), but GloVe's independence structure is cleaner. This contributed to GloVe's slightly faster wall-clock training times on identical hardware.

Training pipeline end-to-end

A complete GloVe training run on 6B tokens:

Stage	Time	Output
1. Vocabulary construction	~30 min	400K words above min-count threshold
2. Corpus scanning	~2 hours	~1B nonzero (i, j, X_ij) entries
3. Shuffle entries	~20 min	Randomized training order
4. Train 50 iterations	~3 hours (8 threads)	400K × 300 word vectors
5. Combine W + W_ctx	Seconds	Final vectors
Total	~6 hours	glove.6B.300d.txt (1.04 GB)

The co-occurrence matrix construction (stage 2) is the most memory-intensive step, requiring ~10 GB of RAM to store the sparse matrix. The training itself (stage 4) is CPU-intensive but memory-light — it only needs the vectors, biases, and AdaGrad accumulators in memory.

Convergence behavior

GloVe converges smoothly because the objective is a weighted least-squares problem — convex in each variable when others are fixed (biconvex overall). The loss curve is typically monotonically decreasing with occasional plateaus. Early iterations reduce loss rapidly; later iterations fine-tune the geometry. The paper found that 50-100 iterations suffice for most corpora, with diminishing returns beyond 100.

python
# Monitoring GloVe training convergence
def evaluate_glove(vectors, analogy_test):
    """Evaluate word vectors on analogy task during training."""
    correct = 0
    total = 0
    for a, b, c, expected in analogy_test:
        if any(w not in vectors for w in [a, b, c, expected]):
            continue
        target = vectors[b] - vectors[a] + vectors[c]
        target /= np.linalg.norm(target) + 1e-10

        best_word, best_sim = None, -1
        for word, vec in vectors.items():
            if word in {a, b, c}:
                continue
            sim = np.dot(target, vec / (np.linalg.norm(vec) + 1e-10))
            if sim > best_sim:
                best_sim, best_word = sim, word

        if best_word == expected:
            correct += 1
        total += 1

    return correct / total if total > 0 else 0

# Typical convergence: accuracy vs. iteration
# Iter 1:   ~30% accuracy (random-ish vectors)
# Iter 10:  ~55% accuracy (structure emerging)
# Iter 25:  ~67% accuracy (most structure captured)
# Iter 50:  ~71% accuracy (fine-tuning)
# Iter 100: ~72% accuracy (diminishing returns)

GloVe Training: Loss Over Iterations

Watch GloVe converge as we minimize the weighted least-squares loss on a tiny toy corpus. Click "Train" to run gradient descent steps.

Epoch 0 | Loss: ?

python
import numpy as np

def train_glove(cooccur, V, d=50, epochs=100, lr=0.05, x_max=100, alpha=0.75):
    """
    cooccur: list of (i, j, X_ij) — nonzero entries
    Returns: word vectors W + W_ctx (sum of both)
    """
    # Initialize
    W = (np.random.rand(V, d) - 0.5) / d
    W_ctx = (np.random.rand(V, d) - 0.5) / d
    b = np.zeros(V)
    b_ctx = np.zeros(V)

    # AdaGrad accumulators
    W_sum = np.ones((V, d))  # init to 1 to avoid division by zero
    W_ctx_sum = np.ones((V, d))
    b_sum = np.ones(V)
    b_ctx_sum = np.ones(V)

    for epoch in range(epochs):
        total_loss = 0.0
        np.random.shuffle(cooccur)

        for i, j, x_ij in cooccur:
            # Weighting
            fw = min((x_ij / x_max) ** alpha, 1.0)

            # Error
            diff = W[i] @ W_ctx[j] + b[i] + b_ctx[j] - np.log(x_ij)
            loss = fw * diff ** 2
            total_loss += loss

            # Gradients
            grad_common = fw * diff
            grad_w = grad_common * W_ctx[j]
            grad_wc = grad_common * W[i]

            # AdaGrad update
            W_sum[i] += grad_w ** 2
            W[i] -= lr * grad_w / np.sqrt(W_sum[i])

            W_ctx_sum[j] += grad_wc ** 2
            W_ctx[j] -= lr * grad_wc / np.sqrt(W_ctx_sum[j])

            b_sum[i] += grad_common ** 2
            b[i] -= lr * grad_common / np.sqrt(b_sum[i])

            b_ctx_sum[j] += grad_common ** 2
            b_ctx[j] -= lr * grad_common / np.sqrt(b_ctx_sum[j])

    return W + W_ctx  # sum of word and context vectors

Complete PyTorch implementation

python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class GloVeDataset(Dataset):
    def __init__(self, cooccur_data, x_max=100, alpha=0.75):
        """cooccur_data: list of (i, j, X_ij) tuples"""
        self.data = cooccur_data
        self.x_max = x_max
        self.alpha = alpha

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        i, j, x = self.data[idx]
        # Compute weight
        weight = (x / self.x_max) ** self.alpha if x < self.x_max else 1.0
        return (torch.tensor(i), torch.tensor(j),
                torch.tensor(x, dtype=torch.float32),
                torch.tensor(weight, dtype=torch.float32))


class GloVe(nn.Module):
    def __init__(self, V, d=300):
        super().__init__()
        self.W = nn.Embedding(V, d)       # word vectors
        self.W_ctx = nn.Embedding(V, d)   # context vectors
        self.b = nn.Embedding(V, 1)       # word biases
        self.b_ctx = nn.Embedding(V, 1)  # context biases

        # Initialize
        for param in self.parameters():
            nn.init.uniform_(param, -0.5/d, 0.5/d)

    def forward(self, i, j, x, weight):
        """
        i, j:     (batch,) — word and context indices
        x:        (batch,) — co-occurrence counts
        weight:   (batch,) — f(X_ij) weights
        """
        w_i = self.W(i)           # (batch, d)
        w_j = self.W_ctx(j)      # (batch, d)
        b_i = self.b(i).squeeze() # (batch,)
        b_j = self.b_ctx(j).squeeze()

        # Prediction: w_i . w_j + b_i + b_j
        pred = (w_i * w_j).sum(dim=1) + b_i + b_j

        # Target: log(X_ij)
        target = torch.log(x)

        # Weighted least squares loss
        loss = weight * (pred - target) ** 2
        return loss.mean()

    def get_vectors(self):
        """Return W + W_ctx as final word vectors."""
        return (self.W.weight + self.W_ctx.weight).detach()

# Training
model = GloVe(V=400000, d=300)
optimizer = torch.optim.Adagrad(model.parameters(), lr=0.05)

for epoch in range(100):
    for i, j, x, w in dataloader:
        loss = model(i, j, x, w)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

vectors = model.get_vectors()  # (400000, 300)

Why does GloVe use the sum W + W̃ as the final word vectors instead of just W?

The objective treats word and context vectors symmetrically (since X is symmetric), so both W and W̃ are equally valid estimates of word meaning. Summing them reduces variance, like averaging two independent estimates of the same quantity W̃ contains information about context that W doesn't The sum doubles the embedding dimension

Chapter 6: The Word2Vec Connection

GloVe's paper made a provocative claim: Word2Vec (Skip-gram with negative sampling) is implicitly factorizing a co-occurrence matrix. The two methods are more similar than they appear.

What Skip-gram NEG implicitly optimizes

Levy and Goldberg (2014) proved (building on the GloVe paper's insight) that Skip-gram with negative sampling, when trained to convergence, satisfies:

w_i^Tw̃_j = PMI(i, j) − log k

where PMI is the pointwise mutual information:

PMI(i, j) = log(P(i, j) / (P(i) · P(j))) = log(X_ij · |D| / (X_i · X_j))

And GloVe's model says:

w_i^Tw̃_j + b_i + b̃_j = log X_ij

These are related! If GloVe's biases absorb the log X_i and log X_j terms:

b_i ≈ log X_i, b̃_j ≈ log X_j

Then:

w_i^Tw̃_j = log X_ij − log X_i − log X_j = log(X_ij / (X_i · X_j)) ≈ PMI(i,j) + const

Both models learn dot products that approximate PMI (or log co-occurrence). The difference is in how they optimize:

Property	Word2Vec (Skip-gram NEG)	GloVe
Training data	Raw text (online, streaming)	Co-occurrence matrix (precomputed)
Objective	Binary classification (pos vs neg)	Weighted least squares on log X
Implicit target	PMI − log k	log X_ij (with biases absorbing marginals)
Zero entries	Handled via negative sampling	Skipped (f(0) = 0)
Weighting	Sampling frequency (noise dist)	Explicit f(X_ij) function
Optimizer	SGD on streaming text	AdaGrad on shuffled matrix entries

The unification: Both methods learn word vectors whose dot products approximate (shifted, weighted) log co-occurrence statistics. The "neural network" in Word2Vec is doing matrix factorization in disguise. The "global matrix" in GloVe is doing essentially the same prediction task, just from a different angle. This realization unified the field and showed that the count-vs-predict distinction was largely artificial.

Three key differences that still matter

Despite the mathematical similarity, some practical differences remain:

1. How they handle zero co-occurrences: This is perhaps the most important distinction. GloVe skips all entries where X_ij = 0 (since log(0) is undefined). Word2Vec handles zeros implicitly through negative sampling — every negative sample is drawn from the noise distribution, which naturally represents the "background" of non-co-occurring words. GloVe never explicitly learns that two words don't co-occur; Word2Vec does, through its negative samples.

2. Online vs. batch: Word2Vec processes the corpus as a stream. It can train on corpora that don't fit in memory. GloVe must first compute the full co-occurrence matrix, which requires a pass over the corpus and ~10 GB of storage for large vocabularies. For very large, streaming datasets, Word2Vec is more practical.

3. Weighting scheme: GloVe's weighting function f(X_ij) explicitly caps the influence of very frequent pairs. Word2Vec's analog is the subsampling of frequent words and the noise distribution exponent (3/4 power). Both achieve similar effects but through different mechanisms.

Worked example: the equivalence

Consider word i = "cat" and word j = "fluffy" in a corpus where:

X_{cat, fluffy} = 200, X_cat = 50,000, X_fluffy = 10,000, |D| = 10⁹

GloVe's target:

log X_ij = log(200) = 5.30

w_cat^T w̃_fluffy + b_cat + b̃_fluffy ≈ 5.30

Skip-gram NEG's implicit target (Levy & Goldberg):

PMI(cat, fluffy) = log(200 × 10⁹ / (50,000 × 10,000)) = log(400) = 5.99

w_cat^T w̃_fluffy ≈ PMI − log(k) = 5.99 − log(5) = 5.99 − 1.61 = 4.38

Both methods produce dot products in the same ballpark (~4-6 for this moderately co-occurring pair). The difference is absorbed by GloVe's biases and the constant shifts.

Skip-gram ↔ GloVe: Same Geometry

Both methods produce vectors in the same region of vector space. The scatter plot shows GloVe vs. Skip-gram dot products for word pairs — they correlate strongly.

What does Skip-gram with negative sampling implicitly factorize?

A shifted PMI matrix — the dot product w_i · w_j converges to PMI(i,j) − log(k), where PMI is the pointwise mutual information of word co-occurrences and k is the number of negative samples The raw co-occurrence count matrix X A TF-IDF matrix

Chapter 7: Results

GloVe was evaluated on three tasks: word analogies, word similarity, and named entity recognition (NER).

Word analogy task

Using the standard analogy test set (semantic + syntactic, ~19K questions):

Model	Dim	Training Data	Accuracy %
SVD (on X)	300	6B	36.7
SVD (on log X)	300	6B	54.6
Word2Vec (SG)	300	6B	65.6
Word2Vec (CBOW)	300	6B	63.6
GloVe	300	6B	71.7
GloVe	300	42B	75.0

GloVe at 300 dimensions on 6B words outperforms Word2Vec on the same data by 6 points. On 42B words (Common Crawl), it reaches 75% — a large improvement from the 65.6% of Word2Vec on 6B.

Word similarity task

Spearman correlation between model's cosine similarities and human similarity judgments on standard datasets (WordSim-353, etc.). GloVe achieved 0.769 on WordSim-353, competitive with the best Word2Vec models.

Named Entity Recognition

Using word vectors as features for a CRF-based NER system on CoNLL-2003:

Features	F1 Score
Discrete features only	88.4
+ SVD vectors (d=50)	89.3
+ Word2Vec (d=50)	90.1
+ GloVe (d=50)	90.5
+ GloVe (d=300)	91.2

GloVe vectors as additional features improved NER F1 from 88.4 to 91.2 — a substantial gain from unsupervised pretraining.

Scaling behavior

The paper studied how performance scales with corpus size, vector dimension, window size, and training time:

Corpus size: Monotonically increasing returns. 6B → 42B improved analogy accuracy from 71.7% to 75.0%.
Dimension: Accuracy increases until d ≈ 300, then plateaus. Higher dimensions provide diminishing returns.
Window size: Larger windows (c = 10) outperform smaller windows (c = 5) on semantic analogies. Smaller windows are better for syntactic analogies.
Training epochs: More epochs help up to ~100 for small corpora; for large corpora, a single pass suffices.

GloVe vs. Word2Vec: Analogy Accuracy

Comparison across methods and training data sizes. GloVe consistently outperforms Word2Vec on the same data.

Training efficiency: GloVe on 6B words with d=300 converges in about 50 iterations over ~1B nonzero entries. Total training time: a few hours on 8 CPU cores. The co-occurrence matrix computation adds an upfront cost, but the optimization itself is embarrassingly parallel — each (i, j) entry is independent — making GloVe well-suited to multi-core and distributed training.

Dimension vs. accuracy: the plateau

The paper's experiments on Wikipedia+Gigaword 6B showed a clear pattern:

Dimension d	Analogy accuracy %	Training time (relative)
50	54.3	0.2x
100	64.0	0.4x
200	69.1	0.7x
300	71.7	1.0x
400	72.3	1.3x
500	72.5	1.7x

Going from 50 to 300 dimensions adds 17 points. Going from 300 to 500 adds only 0.8 points. The "sweet spot" is d = 300 — which is why most pre-trained word vectors use this dimension.

Window size: semantic vs. syntactic

An interesting finding: window size affects what the vectors learn:

Small window (c = 2-5): Vectors capture syntactic similarity. Words with the same part of speech cluster together. "Running," "jumping," "swimming" are close.
Large window (c = 10-15): Vectors capture semantic similarity. Words with related meaning cluster together. "Dog," "cat," "pet" are close, even though they have different syntactic roles.

This makes sense: a word's immediate neighbors are syntactically constrained (adjectives before nouns, determiners before adjectives), while distant words in the same sentence are thematically related. GloVe's default c = 10 favors semantic similarity, which is more useful for most downstream tasks.

GloVe on different corpora

The paper tested GloVe on multiple corpora to study the effect of data quality and quantity:

Corpus	Tokens	Analogy Accuracy %	Notes
Wikipedia 2014	1.6B	64.7	Clean, encyclopedic text
Wikipedia + Gigaword 5	6B	71.7	Clean + newswire
Common Crawl (42B)	42B	75.0	Noisy but massive
Common Crawl (840B)	840B	—	Even noisier; used for released vectors

Two observations: (1) More data always helps, even when the additional data is noisy web text. (2) Clean data is more efficient per token — Wikipedia's 1.6B tokens give 64.7%, while you need 6B tokens of mixed-quality data to reach 71.7%. Quality matters, but quantity can compensate.

Sentence-level and document-level usage

While GloVe produces word-level vectors, they can be composed into sentence or document representations:

python
# Simple document vector: average word vectors
def doc_vector(words, glove, dim=300):
    vecs = [glove[w] for w in words if w in glove]
    if len(vecs) == 0:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

# TF-IDF weighted average (better for information retrieval)
def weighted_doc_vector(words, glove, idf_weights, dim=300):
    vecs, weights = [], []
    for w in words:
        if w in glove and w in idf_weights:
            vecs.append(glove[w])
            weights.append(idf_weights[w])
    if len(vecs) == 0:
        return np.zeros(dim)
    return np.average(vecs, axis=0, weights=weights)

Simple averaging works surprisingly well for short texts (tweets, sentences). For longer documents, TF-IDF weighting or SIF (Smooth Inverse Frequency) weighting by Arora et al. (2017) gives significant improvements by down-weighting common words.

GloVe vs. Word2Vec: a detailed comparison

Having studied both methods in detail, here is a comprehensive comparison:

Aspect	Word2Vec (Skip-gram NEG)	GloVe
Objective	Binary classification (pos vs neg)	Weighted least squares on log X
Data format	Raw text corpus (streaming)	Co-occurrence matrix (precomputed)
Zero co-occurrences	Handled via negative sampling	Skipped (f(0) = 0)
Distance weighting	None within window	1/d harmonic weighting
Frequency handling	Subsampling + noise dist f^0.75	Weighting function f(X) = (X/100)^0.75
Optimizer	SGD with linear LR decay	AdaGrad
Parallelism	Hogwild (async, lock-free)	Embarrassingly parallel (independent entries)
Memory	O(V · d) — just embeddings	O(V · d + nnz) — embeddings + sparse matrix
Final vectors	W_in (input embeddings)	W + W_ctx (sum of both)
Best accuracy (6B)	66% (d=1000)	72% (d=300)
Best accuracy (42B)	~70% (estimated)	75% (d=300)

In practice, the difference between GloVe and Word2Vec is small on most downstream tasks. The choice often depends on:

Use GloVe if: You need pre-trained vectors quickly (excellent pre-trained releases available), you want consistent training behavior, or your corpus fits in memory.
Use Word2Vec if: You're training on custom/streaming data, you need to handle very large corpora that don't fit in memory, or you want the simpler gensim API.

What both methods get wrong

Both GloVe and Word2Vec share fundamental limitations that contextual embeddings (ELMo, BERT) would later address:

Polysemy: "Bank" gets one vector, regardless of whether it means "river bank" or "financial bank." Each word type has exactly one representation.
Compositionality: Neither method has a principled way to compose word vectors into phrase or sentence meanings. Simple averaging works for some tasks but fails for negation, conditionals, and complex syntax.
Out-of-vocabulary: Words not in the training vocabulary have no representation. Misspellings, neologisms, and rare technical terms are invisible.
Position insensitivity: "The dog bit the man" and "The man bit the dog" would get similar representations (both contain the same words), even though they have opposite meanings.

These limitations motivated the move to contextualized representations, where each token gets a different vector depending on the full input sequence. But GloVe's core insight — that ratios of co-occurrence probabilities encode meaning — remains foundational to understanding how all embedding methods work.

GloVe's influence on modern AI

GloVe's ideas have been absorbed into modern deep learning in subtle ways:

Embedding tables everywhere. Every transformer (GPT, BERT, LLaMA) starts with an embedding lookup table — the same matrix that GloVe and Word2Vec learn. The initial embedding layer of GPT-4 is conceptually identical to a Word2Vec/GloVe matrix, just trained end-to-end with the rest of the model.
Co-occurrence as training signal. Modern self-supervised methods (masked language modeling, next-token prediction) are fancy versions of "predict co-occurring words." BERT's [MASK] prediction is CBOW with a transformer on top. GPT's next-token prediction is Skip-gram extended to arbitrary-length context.
Log-bilinear models. GloVe's log-bilinear structure (dot product = log count) appears in matrix factorization recommender systems, topic models, and energy-based models.
The weighting function idea. GloVe's f(X_ij) inspired focal loss, class-balanced sampling, and other techniques for handling imbalanced training data.

Practical recipe for choosing word embeddings (2024)

In modern NLP, the choice is usually straightforward:

Situation	Recommendation	Why
Full NLP pipeline	Use a transformer (BERT, LLaMA)	Contextual embeddings dominate on every benchmark
Simple baseline / prototype	Pre-trained GloVe 300d	Free, fast, no GPU needed, surprisingly competitive
Domain-specific (e.g., biomedical)	Train Word2Vec on domain corpus	Easy to train, captures domain terminology
Need OOV handling	FastText	Character n-grams handle any word
Multilingual	Multilingual BERT or FastText	Cross-lingual transfer
Edge deployment (no GPU)	Pre-trained GloVe 50d or 100d	Tiny memory footprint, CPU-friendly

GloVe and Word2Vec remain relevant as baselines, for education, and for resource-constrained settings. Their simplicity — just a lookup table of vectors — makes them deployable anywhere, including embedded devices, browsers, and mobile apps. A 50d GloVe model fits in 80 MB; a 300d BERT model requires 400 MB. For many applications, the simpler model is good enough.

Evaluating word vectors: beyond analogies

The analogy task (king − man + woman = queen) became the standard evaluation, but it has limitations. The paper also evaluated on several other tasks:

Word similarity: Compute cosine similarity between word pairs and compare to human judgments using Spearman rank correlation. Datasets include WordSim-353, SimLex-999, and MEN. GloVe achieves 0.77 on WordSim-353 (human agreement is ~0.75).

Extrinsic evaluation: Use word vectors as features for a downstream task and measure task performance. The paper chose NER because it is well-understood and has standard benchmarks. The +2.8 F1 improvement from GloVe vectors (88.4 → 91.2) demonstrated that intrinsic quality (analogy accuracy) translates to extrinsic performance.

python
# Evaluating word vectors on similarity task
import numpy as np
from scipy.stats import spearmanr

def evaluate_similarity(vectors, test_pairs):
    """
    test_pairs: list of (word1, word2, human_score) tuples
    Returns: Spearman correlation with human judgments
    """
    model_sims, human_sims = [], []
    for w1, w2, human in test_pairs:
        if w1 not in vectors or w2 not in vectors:
            continue
        v1, v2 = vectors[w1], vectors[w2]
        cos_sim = v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2))
        model_sims.append(cos_sim)
        human_sims.append(human)

    rho, p_value = spearmanr(model_sims, human_sims)
    return rho

# Typical results:
# GloVe 300d on WordSim-353:  rho = 0.769
# Word2Vec 300d on WordSim-353: rho = 0.720
# Human agreement: rho ≈ 0.750

The co-occurrence matrix as a window into language

Building and examining the co-occurrence matrix reveals fascinating properties of natural language:

Zipfian distribution: A tiny fraction of word pairs account for most co-occurrences. The top 0.01% of pairs contain ~50% of all counts.
Extreme sparsity: 99.4% of the V × V matrix is zero. Most word pairs never co-occur within a window of 10 words.
Power-law decay: Co-occurrence counts follow a power law: most pairs co-occur once or twice, a few co-occur millions of times.
Approximate symmetry: X_ij ≈ X_ji (exact with symmetric windows). This is the property GloVe exploits when summing W + W_ctx.

These statistical properties explain why GloVe's design choices work: the weighting function handles the power-law distribution, the log transform handles the vast range of counts, and the biases handle the variation in word frequencies. Every design decision in GloVe is motivated by the empirical statistics of natural language co-occurrence.

GloVe's theoretical contribution

Beyond the practical algorithm, GloVe made an important theoretical contribution to understanding word vectors. The paper argued that the right way to think about word meaning is through ratios of co-occurrence probabilities, not through raw probabilities or raw counts. This perspective has three consequences:

Explains linear structure: Ratios are multiplicative. In log space, multiplicative relationships become additive. Additive relationships in the vector space = linear structure. This explains why king − man + woman ≈ queen: the log-ratio of co-occurrence probabilities is additive.
Unifies count and predict: Both LSA (which uses counts) and Word2Vec (which predicts) are implicitly capturing the same ratios, just from different angles. GloVe's derivation shows that the ratio is the fundamental quantity; count-based and predictive methods are two ways to estimate it.
Motivates the objective: Most previous work used ad hoc objectives (SVD on raw counts, cross-entropy on predictions). GloVe's objective is derived from the ratio requirement. This principled derivation is what makes GloVe's paper theoretically influential, even for practitioners who prefer Word2Vec in practice.

The paper's equation F(w_i − w_j, w̃_k) = P_ik/P_jk is one of the most cited equations in NLP. It answers a question that Word2Vec left open: why do word vectors have linear structure? GloVe's answer: because the underlying semantic signal (co-occurrence ratios) is inherently log-linear, and the training objective preserves this structure.

How does GloVe compare to Word2Vec on the same training data (6B words, 300 dimensions)?

GloVe achieves 71.7% on word analogies vs. Word2Vec's 65.6% — a 6-point improvement — while also outperforming on downstream NER (F1 90.5 vs. 90.1) They perform identically Word2Vec outperforms GloVe significantly

Chapter 8: Practical Details

GloVe became one of the most widely used word embedding methods. Here are the practical details for training and using GloVe vectors.

Hyperparameters

Hyperparameter	Default	Effect
Dimension d	300	Higher is better up to ~300, then diminishing returns
Window size c	10	Larger for semantic tasks, smaller for syntactic
x_max	100	Cap for weighting function; higher = more weight to frequent pairs
α	0.75	Sublinear weighting exponent; 0.75 works well universally
Learning rate	0.05	With AdaGrad, which adapts per-parameter
Epochs	50-100	More for smaller corpora; 1 pass suffices for very large data
Min count	5	Words appearing fewer times are discarded

Pre-trained vectors

Stanford released pre-trained GloVe vectors that became the standard initialization for NLP models from 2014-2018:

Dataset	Tokens	Vocab	Dims available
Wikipedia + Gigaword 5	6B	400K	50, 100, 200, 300
Common Crawl (uncased)	42B	1.9M	300
Common Crawl (cased)	840B	2.2M	300
Twitter	27B	1.2M	25, 50, 100, 200

Using GloVe vectors in practice

python
import numpy as np

def load_glove(filepath, dim=300):
    """Load pre-trained GloVe vectors from text file."""
    word2vec = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vec = np.array([float(x) for x in parts[1:]])
            word2vec[word] = vec
    return word2vec

# Usage
glove = load_glove('glove.6B.300d.txt')
print(glove['king'].shape)  # (300,)

# Cosine similarity
def cosine_sim(a, b):
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_sim(glove['king'], glove['queen']))    # ~0.75
print(cosine_sim(glove['king'], glove['banana']))   # ~0.12

# As PyTorch embedding initialization
import torch
import torch.nn as nn

embedding = nn.Embedding(V, 300)
for word, idx in vocab.items():
    if word in glove:
        embedding.weight.data[idx] = torch.tensor(glove[word])
# Freeze or fine-tune depending on task and dataset size

GloVe vs. Word2Vec in practice: Despite GloVe's theoretical advantages, many practitioners found Word2Vec and GloVe performed similarly on downstream tasks when trained on comparable data. The choice often came down to convenience: GloVe's pre-trained vectors were more readily available and easier to load (simple text format). Word2Vec's gensim library was easier to train on custom data. Both were superseded by contextual embeddings (ELMo 2018, BERT 2018) for most tasks.

Fine-tuning vs. freezing

When using pre-trained GloVe vectors as initialization for a downstream task, you have two options:

Strategy	When to use	Pros	Cons
Freeze	Small dataset (<10K examples)	Prevents overfitting; preserves pretrained quality	Cannot adapt to domain-specific usage
Fine-tune	Large dataset (>100K examples)	Adapts vectors to your specific task	May overfit on small data; loses general knowledge

A common compromise: start with frozen embeddings for a few epochs (letting the rest of the model warm up), then unfreeze and fine-tune with a small learning rate.

python
# Common pattern: freeze then fine-tune
import torch.nn as nn

embedding = nn.Embedding(V, 300)

# Load GloVe vectors
for word, idx in vocab.items():
    if word in glove:
        embedding.weight.data[idx] = torch.tensor(glove[word])

# Phase 1: Freeze embeddings, train classifier layers
embedding.weight.requires_grad = False
for epoch in range(5):
    train_epoch(model)  # only classifier weights update

# Phase 2: Unfreeze, fine-tune everything with low LR
embedding.weight.requires_grad = True
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)  # 10x lower LR
for epoch in range(10):
    train_epoch(model)  # all weights update, including embeddings

GloVe file format

GloVe vectors are distributed as plain text files — one word per line, space-separated values. This is simpler than Word2Vec's binary format (which uses struct packing) but larger on disk.

text
# Format of glove.6B.300d.txt:
# Each line: word val1 val2 ... val300
the 0.418 0.249 -0.412 0.122 ... (300 floats)
, 0.013 0.189 -0.350 0.076 ...
. 0.152 0.304 -0.134 0.014 ...
of 0.370 0.210 -0.310 0.328 ...
# ... 400,000 lines total
# File size: 1.04 GB for 300d, 347 MB for 100d

Out-of-vocabulary (OOV) words

GloVe has no mechanism for words not in the pre-trained vocabulary. Common strategies for handling OOV words:

Random initialization: Assign a random vector. Works if OOV words are rare.
Zero vector: Simple but loses information — the model learns to ignore unknown words.
Average of all vectors: Use the mean of all word vectors as a "generic word" representation.
Use FastText instead: FastText builds embeddings from character n-grams, so it can construct vectors for any word, even misspellings and neologisms.

Memory requirements

Loading GloVe vectors requires substantial memory:

Memory = V × d × 4 bytes (float32)

glove.6B.300d: 400K × 300 × 4 = 480 MB

glove.840B.300d: 2.2M × 300 × 4 = 2.64 GB

For deployment in memory-constrained environments, techniques like quantization (float16 or int8) or dimensionality reduction (PCA from 300d to 100d) are common.

Pre-trained GloVe: Nearest Neighbors

Type a word to see its nearest neighbors in GloVe space (simulated with common examples). Click preset words to explore.

What ultimately replaced GloVe and Word2Vec as the standard word representation in NLP?

Contextual embeddings from ELMo (2018) and BERT (2018), which produce different vectors for the same word depending on context — "bank" gets different representations in "river bank" vs. "bank account" Larger GloVe models with more dimensions One-hot encodings with better feature engineering

Chapter 9: Connections

What GloVe built on

Latent Semantic Analysis (Deerwester et al., 1990): SVD on a term-document matrix. GloVe improved on LSA by using a word-word matrix instead of word-document, log co-occurrence instead of raw counts, and a weighted objective instead of unweighted SVD.

Word2Vec (Mikolov et al., 2013): The direct predecessor and competitor. GloVe explicitly positions itself as a synthesis of Word2Vec's strengths (local context prediction, linear structure) and LSA's strengths (global statistics). See our Word2Vec veanor and negative sampling veanor.

HAL (Hyperspace Analogue to Language, Lund & Burgess, 1996): An early word-word co-occurrence approach. GloVe's matrix is similar but uses harmonic weighting and a principled objective.

Why GloVe beats SVD on X

A natural question: if GloVe trains on the co-occurrence matrix X, why not just do SVD directly on X (or log X)? The paper tested this and found several reasons GloVe wins:

Weighting. SVD treats all entries equally. GloVe's f(X_ij) gives appropriate weight to each entry. Very frequent pairs don't dominate; rare pairs don't add noise.
Log transform. SVD on raw X performs poorly (36.7% accuracy). SVD on log X is much better (54.6%). GloVe naturally operates in log space (the target is log X_ij).
Zero entries. SVD must handle the V² matrix including all zeros. GloVe skips zeros (f(0) = 0). This is both computationally cheaper and statistically better — zeros carry no information and treating them as targets introduces bias.
Biases. GloVe's bias terms b_i, b̃_j absorb word frequency, separating it from semantic content. SVD has no analog — frequency contaminates the singular vectors.

python
# Comparison: SVD on log(X) vs. GloVe
from scipy.sparse.linalg import svds
import numpy as np
from scipy.sparse import csr_matrix

# SVD approach (log transform, shift by 1 to handle zeros)
X_log = np.log(X.toarray() + 1)              # (V, V)
U, S, Vt = svds(csr_matrix(X_log), k=300)  # truncated SVD
svd_vectors = U * np.sqrt(S)                  # (V, 300)

# SVD accuracy: ~55% on analogies
# GloVe accuracy: ~72% on analogies
# Why the gap? Weighting, biases, and skip-zero optimization

What GloVe enabled

The embedding revolution (2014-2018): GloVe's pre-trained 300d vectors became the default initialization for nearly every NLP model: sentiment analysis, NER, relation extraction, question answering. The idea of "transfer learning from unsupervised pretraining" was proven by GloVe and Word2Vec before it was scaled up by ELMo and BERT.

Levy & Goldberg (2014): Formalized the Word2Vec-GloVe connection, showing both implicitly factorize PMI matrices. This unified the count-predict debate and led to a deeper understanding of what makes word embeddings work.

FastText (2017), ELMo (2018), BERT (2018): Each extended the word embedding paradigm in its own direction. FastText added subwords. ELMo added context-dependence. BERT added bidirectional context with transformers. All stood on the foundation that GloVe and Word2Vec established.

The evolution of word representations

Era	Method	Representation	Key property
1990s	LSA (SVD on counts)	Static, count-based	Global statistics, poor analogies
2013	Word2Vec	Static, prediction-based	Linear analogies, local context
2014	GloVe	Static, hybrid	Global + local, best analogies
2017	FastText	Static, subword-based	OOV handling, morphology
2018	ELMo	Contextual (LSTM)	Different vectors per context
2018	BERT	Contextual (Transformer)	Bidirectional, fine-tunable
2020+	GPT-3/4, LLaMA	Contextual (large Transformer)	In-context learning, emergent abilities

Each step built on the previous. GloVe showed that the right objective matters more than the algorithm. BERT showed that context matters more than static vectors. GPT-4 showed that scale matters more than architecture. But the foundational insight — that useful representations emerge from prediction tasks on raw text — traces directly back to Word2Vec and GloVe.

GloVe's influence on loss design

GloVe's weighted least-squares objective influenced later work in surprising ways:

The weighting function idea — using a sublinear function to prevent frequent items from dominating — appears in focal loss (Lin et al., 2017), class-balanced losses, and curriculum learning.
Log-bilinear models — where the dot product of two representations equals the log of a count — appear in recommendation systems (matrix factorization for ratings) and topic models.
Using biases to absorb marginals — letting bias terms capture per-item popularity rather than forcing the main parameters to learn it — is standard practice in collaborative filtering.

The full GloVe pipeline in code

bash
# The official GloVe training pipeline (Stanford release)
# Download from https://nlp.stanford.edu/projects/glove/

# Step 1: Build vocabulary
./vocab_count -min-count 5 < corpus.txt > vocab.txt
# Output: 400,000 words above threshold

# Step 2: Build co-occurrence matrix
./cooccur -window-size 10 -vocab-file vocab.txt < corpus.txt > cooccur.bin
# Output: ~1B nonzero entries, ~10 GB binary file

# Step 3: Shuffle (for SGD convergence)
./shuffle < cooccur.bin > cooccur.shuf.bin

# Step 4: Train
./glove -vector-size 300 -threads 8 -iter 50 \
  -eta 0.05 -alpha 0.75 -x-max 100 \
  -input-file cooccur.shuf.bin -vocab-file vocab.txt \
  -save-file vectors
# Output: vectors.txt (word + 300 floats per line)

python
# Using pre-trained GloVe with numpy (no dependencies)
import numpy as np

def load_glove(path):
    vectors = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            vectors[parts[0]] = np.array([float(x) for x in parts[1:]])
    return vectors

# Common operations
glove = load_glove('glove.6B.300d.txt')

# Cosine similarity
def sim(a, b):
    return glove[a] @ glove[b] / (np.linalg.norm(glove[a]) * np.linalg.norm(glove[b]))

print(f"king-queen: {sim('king','queen'):.3f}")  # 0.751
print(f"cat-dog: {sim('cat','dog'):.3f}")      # 0.762

# Analogy: king - man + woman = ?
target = glove['king'] - glove['man'] + glove['woman']
target /= np.linalg.norm(target)
best = max(
    ((w, target @ v / np.linalg.norm(v)) for w, v in glove.items()
     if w not in {'king', 'man', 'woman'}),
    key=lambda x: x[1]
)
print(f"king - man + woman = {best[0]} ({best[1]:.3f})")
# king - man + woman = queen (0.891)

Known limitations of GloVe

No OOV handling: GloVe cannot produce vectors for words not in the training vocabulary. FastText solved this by using character n-grams.
Static vectors: Like Word2Vec, GloVe produces one vector per word regardless of context. "Bank" in "river bank" and "bank account" get identical representations.
Memory for co-occurrence matrix: For V = 2M words, the sparse matrix requires ~10-50 GB of storage. This limits accessibility compared to Word2Vec, which can stream the corpus.
Window-based context only: GloVe's co-occurrence matrix is built from local windows, not document-level co-occurrence. This means it can miss long-range topical relationships.
Bias amplification: Like all word embedding methods, GloVe encodes and potentially amplifies societal biases present in the training corpus. Bolukbasi et al. (2016) showed that GloVe vectors exhibit gender bias ("programmer" is closer to "man" than "woman").

Cheat sheet

Core idea

Train on the co-occurrence matrix X so that w_i · w_j + b_i + b_j ≈ log X_ij

Key insight

Meaning is in the ratio P(k|ice)/P(k|steam), not the raw probabilities. Ratios → exponentials → log-bilinear model

Weighting

f(x) = min((x/x_max)^0.75, 1). Caps frequent pairs, ignores zero entries

Final vectors

W + W_ctx (sum of word and context vectors). 300d, trained by AdaGrad

Impact

75% analogy accuracy. Standard NLP initialization 2014-2018. Unified count-vs-predict debate

What is GloVe's core mathematical insight about how word meaning is encoded in co-occurrence statistics?

Meaning is encoded in the ratios of co-occurrence probabilities (P(k|ice)/P(k|steam)), not raw probabilities. Ratios far from 1 are discriminative. This leads to a log-bilinear model where dot products approximate log co-occurrence counts Meaning is encoded in the raw co-occurrence counts directly Meaning is encoded in the eigenvectors of the co-occurrence matrix

GloVe: Global Vectors

Chapter 0: Two Paradigms

Paradigm 1: Count-based methods

Paradigm 2: Predictive methods