Pennington, Socher, Manning (Stanford) — EMNLP 2014

GloVe: Global Vectors

The best of both worlds — combine the global statistical information of count-based methods like LSA with the local context-window power of Word2Vec, by training on the co-occurrence matrix with a clever weighted least-squares objective.

Prerequisites: Word2Vec + Matrix factorization + Least squares
10
Chapters
8+
Simulations

Chapter 0: Two Paradigms

By 2014, there were two fundamentally different approaches to learning word vectors, and they seemed unrelated.

Paradigm 1: Count-based methods

Latent Semantic Analysis (LSA) and related methods build a large word-context co-occurrence matrix, then reduce its dimensionality using SVD (Singular Value Decomposition). They capture global statistics — how often every word appears with every other word across the entire corpus.

Paradigm 2: Predictive methods

Word2Vec trains a neural network to predict context words from a center word (or vice versa) using a sliding window. It captures local patterns — which words appear near each other in a window of 5-10 words.

The Stanford NLP group asked: why do these approaches give different results? Can we get the best of both? The answer was GloVe (Global Vectors) — a model that trains on the global co-occurrence matrix but uses a log-bilinear objective that produces the same kind of linear structure as Word2Vec.

Two Paradigms Compared

Count-based methods look at the whole matrix at once; predictive methods slide a window. GloVe combines both. Toggle to compare approaches.

GloVe's thesis: The analogy structure (king − man + woman ≈ queen) is not unique to neural network prediction. It emerges from any method that captures the right statistics. GloVe shows that training directly on ratios of co-occurrence probabilities is the key ingredient — and you can do this with a simple, non-neural weighted least-squares objective that trains faster than Word2Vec.
What are the respective weaknesses of count-based (LSA) and predictive (Word2Vec) methods?

Chapter 1: The Co-occurrence Matrix

GloVe starts by building a word-word co-occurrence matrix X from the entire corpus. Entry Xij counts how many times word j appears in the context of word i.

Building the matrix

Slide a symmetric window of size c around each word. For each (center, context) pair, increment Xcenter, context. But with a twist: GloVe uses harmonic weighting — context words that are farther from the center contribute less. A word at distance d contributes 1/d to the count.

Xij = ∑all positions where i is centerd=1c (1/d) · [word at distance d is j]

For the sentence "the cat sat on the mat" with c = 2:

After processing the entire corpus, X is a V × V matrix. It is sparse — most word pairs never co-occur — and symmetric (Xij = Xji for symmetric windows).

Key derived quantities

Xi = ∑k Xik

The total co-occurrence count for word i (sum of row i).

Pij = P(j | i) = Xij / Xi

The probability that word j appears in the context of word i.

Sparsity and scale

For a real corpus:

CorpusVocab VMatrix size V²Nonzero entriesDensity
Wikipedia 6B400K160 billion~1 billion0.6%
Common Crawl 42B1.9M3.6 trillion~10 billion0.0003%

The matrix is extremely sparse. GloVe only processes nonzero entries, which is why it scales to enormous vocabularies. The co-occurrence matrix is stored as a sparse data structure (list of (i, j, X_ij) triples), not as a dense V × V array.

Worked example

For the sentence "the cat sat on the mat" with c = 2 and harmonic weighting:

X[cat, sat] = 1/1 = 1.0   (distance 1)
X[cat, the] = 1/1 = 1.0   (distance 1, first "the")
X[cat, on] = 1/2 = 0.5   (distance 2)

After scanning the entire sentence, each cell accumulates contributions from all co-occurrence events. The harmonic weighting 1/d naturally down-weights distant context words.

Why harmonic weighting? A word 5 positions away from the center is weaker context than a word 1 position away. The 1/d weighting captures this. Without it, all positions within the window are treated equally, losing positional information. This is GloVe's equivalent of Word2Vec's uniform window — but GloVe encodes distance information in the co-occurrence counts themselves.
Co-occurrence Matrix Builder

Watch the matrix get built as we slide a window across a sentence. Brighter cells = higher co-occurrence counts. Drag the window size to see how it affects the matrix.

Window c 2
python
import numpy as np
from collections import defaultdict

def build_cooccurrence(corpus, vocab, window=10):
    """
    corpus: list of list of int (tokenized, vocab-indexed sentences)
    vocab: dict mapping word -> index
    window: context window size
    Returns: sparse co-occurrence matrix (dict of dicts)
    """
    V = len(vocab)
    cooccur = defaultdict(lambda: defaultdict(float))

    for sentence in corpus:
        for i, center in enumerate(sentence):
            for d in range(1, window + 1):
                weight = 1.0 / d  # harmonic weighting
                for offset in [-d, d]:
                    j = i + offset
                    if 0 <= j < len(sentence):
                        context = sentence[j]
                        cooccur[center][context] += weight

    return cooccur  # cooccur[i][j] = X_ij

# For a corpus of 6B tokens with V=400K words:
# - X has ~400K x 400K = 160B entries
# - But only ~1B are nonzero (0.6% density)
# - Stored as sparse matrix: ~10 GB

Comparison: GloVe vs. Word2Vec co-occurrence

Both methods use a sliding context window, but handle the counts differently:

PropertyWord2Vec (Skip-gram)GloVe
CountingImplicit (each pair is a training example)Explicit (build X matrix first)
Distance weightingNone (all positions in window treated equally)1/d harmonic weighting
Frequency weightingSubsampling frequent words + noise dist f^0.75f(X_ij) = min((X/x_max)^0.75, 1)
MemoryO(V · d) for embeddings onlyO(nonzero entries) for X + O(V · d) for vectors
Passes over dataStream corpus once (or few times)One pass to build X, then iterate on X
Why does GloVe use 1/d harmonic weighting when building the co-occurrence matrix?

Chapter 2: The Ratio Insight

This chapter contains GloVe's most important contribution: the insight that word meaning is encoded not in raw co-occurrence probabilities, but in their ratios.

The ice/steam example

Consider two target words: "ice" and "steam." We want to learn vectors that capture their relationship. Let's look at their co-occurrence probabilities with various probe words k:

Probe word kP(k | ice)P(k | steam)P(k | ice) / P(k | steam)
solid1.9 × 10−42.2 × 10−58.9 (large — "solid" is much more ice-like)
gas6.6 × 10−57.8 × 10−40.085 (small — "gas" is much more steam-like)
water3.0 × 10−32.2 × 10−31.36 (near 1 — "water" is related to both)
fashion1.7 × 10−51.8 × 10−50.96 (near 1 — "fashion" is unrelated to both)

The raw probabilities P(k | ice) and P(k | steam) are hard to interpret in isolation — they're tiny numbers that depend on the overall frequency of each word. But the ratio tells a clear story:

The key insight: The information that distinguishes "ice" from "steam" is encoded in the ratio of co-occurrence probabilities, not the probabilities themselves. GloVe's objective is designed to learn word vectors that capture these ratios. This is why GloVe produces vectors with linear analogy structure — the ratio is a multiplicative relationship, and in log space, it becomes additive. Additive relationships in log space = linear structure in the embedding.

More ratio examples

The ratio insight generalizes beyond ice/steam. Consider "cat" vs. "dog":

Probe word kP(k | cat)P(k | dog)RatioInterpretation
purr5.2 × 10−51.3 × 10−640Strongly cat-like
bark2.1 × 10−68.7 × 10−50.024Strongly dog-like
pet4.1 × 10−43.8 × 10−41.08Equally related to both
algorithm1.0 × 10−60.9 × 10−61.11Unrelated to both

The ratio cleanly separates four cases: (1) cat-specific words (ratio ≫ 1), (2) dog-specific words (ratio ≪ 1), (3) shared pet words (ratio ≈ 1, large probabilities), and (4) irrelevant words (ratio ≈ 1, tiny probabilities). No other statistic captures this distinction so cleanly.

Why raw probabilities fail

If we tried to use raw P(k | ice) instead of ratios, we'd face a problem: the probabilities depend on the overall frequency of the target word. A very common word like "the" has large P(k | "the") for almost every k — not because "the" is semantically related to everything, but because it appears in so many contexts. The ratio P(k | ice) / P(k | steam) cancels out this frequency effect, isolating the semantic signal.

From ratios to vectors

GloVe's design requirement: the word vectors should encode the co-occurrence probability ratios. Specifically, we want a function F such that:

F(wi, wj, w̃k) = Pik / Pjk

where wi, wj are target word vectors and w̃k is a context word vector. The function F takes three vectors and should produce the probability ratio. The next chapter derives what F must be.

Co-occurrence Probability Ratios

The ratio P(k|ice)/P(k|steam) reveals which probe words discriminate between "ice" and "steam." Ratios far from 1 are discriminative; ratios near 1 are uninformative.

Why are co-occurrence probability ratios more informative than raw probabilities for distinguishing word meanings?

Chapter 3: Deriving the Objective

This is the mathematical heart of GloVe. We start from the ratio requirement and derive the training objective step by step.

Step 1: The ratio should depend on vector differences

We want F(wi, wj, w̃k) = Pik/Pjk. Since the ratio captures how word i differs from word j with respect to context k, it should depend on the difference wi − wj:

F(wi − wj, w̃k) = Pik / Pjk

Step 2: The output is a scalar

F takes two vectors (wi − wj and w̃k) and produces a scalar (the ratio). The simplest way: use the dot product.

F((wi − wj)Tk) = Pik / Pjk

Step 3: F must be a homomorphism

The right side is a ratio: Pik/Pjk. The left side has a difference: (wi − wj)Tk = wiTk − wjTk. For F to convert a difference (additive) into a ratio (multiplicative), F must be an exponential:

F = exp

So:

exp(wiTk − wjTk) = Pik / Pjk

This gives us:

exp(wiTk) / exp(wjTk) = Pik / Pjk

Step 4: Match individual terms

For the ratio to work out term-by-term:

exp(wiTk) = λ · Pik

Taking the logarithm:

wiTk = log Pik + log λ
wiTk = log Xik − log Xi + log λ

Step 5: Absorb constants into biases

The term log Xi depends only on word i, not on context k. Absorb it (and log λ) into a bias term bi. For symmetry, add a context bias b̃k:

wiTk + bi + b̃k = log Xik
This is the GloVe model: The dot product of word vector wi and context vector w̃k, plus biases, should equal the log of the co-occurrence count. It's a log-bilinear regression model. The entire derivation started from "vectors should capture probability ratios" and arrived at "dot product should approximate log co-occurrence counts."

Step 6: The least-squares objective

Train by minimizing the squared error between the model's prediction and the actual log co-occurrence:

J = ∑i,j=1V f(Xij) · (wiTj + bi + b̃j − log Xij)2

where f(Xij) is a weighting function (Chapter 4). The sum is over all non-zero entries of X — typically around 1 billion entries for a large corpus.

Worked example: one gradient step

Suppose word i = "ice" and word j = "cold" with Xij = 50. Our current parameters:

wice = [0.3, −0.1],   w̃cold = [0.5, 0.2],   bice = 1.0,   b̃cold = 0.5

Prediction:

wiceTcold + bice + b̃cold = (0.15 − 0.02) + 1.0 + 0.5 = 1.63

Target:

log Xij = log(50) = 3.912

Error:

diff = 1.63 − 3.912 = −2.282

Weight:

f(50) = (50/100)0.75 = 0.50.75 = 0.595

Weighted loss:

L = 0.595 × (−2.282)2 = 0.595 × 5.207 = 3.098

Gradient for wice:

∂L/∂wice = f(50) × diff × w̃cold = 0.595 × (−2.282) × [0.5, 0.2] = [−0.679, −0.272]

The gradient is negative — it pushes wice in the direction of w̃cold, increasing their dot product, bringing the prediction closer to log(50) = 3.912. This is exactly what we want: "ice" and "cold" co-occur frequently, so their dot product should be large.

The GloVe Derivation Flow

Six steps from the ratio requirement to the final objective. The key: exponential converts additive (vector difference) to multiplicative (probability ratio).

python
import numpy as np

def glove_loss(W, W_ctx, b, b_ctx, cooccur, f_weights):
    """
    W:       (V, d) — word vectors
    W_ctx:   (V, d) — context vectors
    b:       (V,)   — word biases
    b_ctx:   (V,)   — context biases
    cooccur: list of (i, j, X_ij) — nonzero entries
    f_weights: list of f(X_ij) — precomputed weights
    Returns: total loss
    """
    total_loss = 0.0
    for (i, j, x_ij), f_x in zip(cooccur, f_weights):
        # Model prediction
        prediction = W[i] @ W_ctx[j] + b[i] + b_ctx[j]

        # Target
        target = np.log(x_ij)

        # Weighted squared error
        total_loss += f_x * (prediction - target) ** 2

    return total_loss

Why the derivation works

Let's pause and appreciate the elegance of this derivation. We started with a vague requirement — "vectors should capture probability ratios" — and ended with a concrete equation. The key mathematical insight is the homomorphism property: we needed a function F that maps addition (in the argument) to multiplication (in the output). The only continuous function with this property is the exponential. Once you insist on F = exp, the entire model follows.

This is why GloVe produces vectors with linear analogy structure. The exponential function converts additive relationships in the vector space into multiplicative relationships in probability space. Semantic relationships are multiplicative (ratios), so they must be additive in the log-space of word vectors. King − man + woman = queen works because "royalty" and "gender" are independent multiplicative factors in the co-occurrence statistics.

The symmetry argument

One subtlety in the derivation: the co-occurrence matrix X is symmetric (Xij = Xji for symmetric windows), but the model equation treats wi and w̃j differently. GloVe resolves this by noting that the model should work equally well with roles swapped. This means the word vectors and context vectors should be interchangeable — which is why summing W + W̃ as the final representation makes sense. It explicitly restores the symmetry that the parameterization breaks.

What the biases learn

The bias terms bi and b̃j absorb word-frequency information. After training:

bi ≈ log(Xi) ≈ log(frequency of word i)

This separates frequency information from semantic information. The word vector wi captures what a word means; the bias bi captures how often it appears. Without biases, the word vectors would need to encode both — and the frequency signal would contaminate the semantic structure.

What is the GloVe model equation, and what does it say in plain English?

Chapter 4: The Weighting Function

The raw objective J = ∑ (wiTj + bi + b̃j − log Xij)2 has a problem: it treats all co-occurrence counts equally. But a pair that co-occurs 10,000 times should matter more than a pair that co-occurs twice, and a pair that co-occurs 1,000,000 times shouldn't dominate the objective.

The solution is the weighting function f(Xij):

f(x) = (x / xmax)α     if x < xmax
f(x) = 1                   if x ≥ xmax

The paper sets xmax = 100 and α = 3/4.

Why this specific function?

Three design requirements:

  1. f(0) = 0. If two words never co-occur, the entry shouldn't contribute to the loss at all. (We skip zero entries.) This is critical because the matrix is sparse — most entries are zero.
  2. f(x) should be non-decreasing. More frequent co-occurrences should get at least as much weight. A pair that co-occurs 100 times provides more reliable statistics than a pair that co-occurs 3 times.
  3. f(x) should not be too large for very high x. Ultra-frequent pairs like ("the", "of") shouldn't overwhelm the objective. The cap at f(x) = 1 prevents this.

The effect of α

The exponent α controls how rapidly the weight grows with co-occurrence count:

Why 3/4 again? Both GloVe's weighting and Word2Vec's negative sampling noise distribution use the 3/4 exponent. This is not a coincidence. In both cases, the 3/4 power provides a sublinear scaling that prevents frequent items from dominating while still giving them more influence than rare items. The specific value 3/4 was found empirically in both papers.
Weighting Function f(x)

The weighting function caps the influence of very frequent co-occurrences. Drag α and x_max to see how the function shape changes.

α 0.75
xmax 100
python
import numpy as np

def f_weight(x, x_max=100, alpha=0.75):
    """GloVe weighting function."""
    if x >= x_max:
        return 1.0
    return (x / x_max) ** alpha

# Examples:
# f(1)   = (1/100)^0.75   = 0.018  — rare pair: very low weight
# f(10)  = (10/100)^0.75  = 0.178  — moderate: meaningful weight
# f(50)  = (50/100)^0.75  = 0.595  — common: substantial weight
# f(100) = 1.0            = 1.000  — capped: maximum weight
# f(10000) = 1.0          = 1.000  — still capped

Worked example: f(x) for different co-occurrence levels

Consider four word pairs with different co-occurrence counts:

Word pairXijf(Xij)log XijContribution to loss
"quantum" + "mechanics"3(3/100)0.75 = 0.0571.099Tiny weight: rare pair, may be noisy
"cat" + "animal"25(25/100)0.75 = 0.3543.219Moderate weight: reliable signal
"the" + "is"5001.0 (capped)6.215Maximum weight but not overwhelming
"the" + "the"50,0001.0 (capped)10.82Same weight as "the+is" despite 100x more frequent

Without the cap, the ("the", "the") pair would dominate the loss by a factor of 50,000x over ("quantum", "mechanics"). The weighting function compresses this range from 50,000:1 to approximately 17:1 (1.0/0.057). This is still a large range — frequent pairs matter more — but it is manageable.

What problem does the weighting function f(X_ij) solve in the GloVe objective?

Chapter 5: Training GloVe

GloVe's training is simpler than Word2Vec's. There is no neural network, no backpropagation through hidden layers, no softmax. It is a weighted least-squares regression optimized by gradient descent.

Parameters

For each word i in the vocabulary:

Total parameters: 2V(d + 1). For V = 400,000 and d = 300: about 240 million parameters.

Training procedure

Step 1: Build X
Scan corpus once to build co-occurrence matrix. Store only nonzero entries (~1B for large corpora).
Step 2: Initialize
Random init for all w, w̃, b, b̃. Small values, uniform or normal distribution.
Step 3: Iterate
For each nonzero X_ij: compute gradient of f(X_ij)(w_i · w̃_j + b_i + b̃_j − log X_ij)². Update by AdaGrad.
↻ 50-100 epochs
Step 4: Combine
Final vectors: W + W̃. Sum of word and context vectors (exploits symmetry of X).

The gradient

For a single (i, j) entry:

∂J/∂wi = f(Xij) · (wiTj + bi + b̃j − log Xij) · w̃j
∂J/∂w̃j = f(Xij) · (wiTj + bi + b̃j − log Xij) · wi
∂J/∂bi = ∂J/∂b̃j = f(Xij) · (wiTj + bi + b̃j − log Xij)

These are simple: error times the other vector (for word/context vectors) or error times 1 (for biases). No chain rule through nonlinearities.

Why W + W̃?

GloVe uses two sets of vectors: word vectors W and context vectors W̃. The objective treats them symmetrically (since X is symmetric for symmetric windows). The paper found that the sum W + W̃ consistently outperforms using either alone. Intuitively, averaging two independent estimates of the same quantity reduces variance.

Why AdaGrad?

GloVe uses AdaGrad (Adaptive Gradient) rather than standard SGD. AdaGrad maintains a per-parameter sum of squared gradients and divides the learning rate by the square root of this sum:

θt+1 = θt − η / √(Gt + ε) · gt

where Gt = ∑τ=1t gτ2 is the accumulated squared gradient and gt is the current gradient.

AdaGrad is ideal for GloVe because:

GloVe vs. Word2Vec training: Word2Vec trains online — each word updates the model as it's read from the corpus. GloVe is batch — it first counts everything, then optimizes. This means GloVe can be parallelized more easily (each nonzero entry is an independent training example) and converges more predictably. The downside: you need to store the co-occurrence matrix (tens of GB for large corpora).

Parallelism

GloVe's training is embarrassingly parallel. Each (i, j, Xij) entry produces an independent gradient. Multiple threads can process different entries simultaneously with minimal synchronization (just atomic updates to shared vectors). The original GloVe implementation achieved near-linear speedup with up to 32 threads.

Word2Vec also parallelizes well (using Hogwild-style asynchronous SGD on the text stream), but GloVe's independence structure is cleaner. This contributed to GloVe's slightly faster wall-clock training times on identical hardware.

Training pipeline end-to-end

A complete GloVe training run on 6B tokens:

StageTimeOutput
1. Vocabulary construction~30 min400K words above min-count threshold
2. Corpus scanning~2 hours~1B nonzero (i, j, X_ij) entries
3. Shuffle entries~20 minRandomized training order
4. Train 50 iterations~3 hours (8 threads)400K × 300 word vectors
5. Combine W + W_ctxSecondsFinal vectors
Total~6 hoursglove.6B.300d.txt (1.04 GB)

The co-occurrence matrix construction (stage 2) is the most memory-intensive step, requiring ~10 GB of RAM to store the sparse matrix. The training itself (stage 4) is CPU-intensive but memory-light — it only needs the vectors, biases, and AdaGrad accumulators in memory.

Convergence behavior

GloVe converges smoothly because the objective is a weighted least-squares problem — convex in each variable when others are fixed (biconvex overall). The loss curve is typically monotonically decreasing with occasional plateaus. Early iterations reduce loss rapidly; later iterations fine-tune the geometry. The paper found that 50-100 iterations suffice for most corpora, with diminishing returns beyond 100.

python
# Monitoring GloVe training convergence
def evaluate_glove(vectors, analogy_test):
    """Evaluate word vectors on analogy task during training."""
    correct = 0
    total = 0
    for a, b, c, expected in analogy_test:
        if any(w not in vectors for w in [a, b, c, expected]):
            continue
        target = vectors[b] - vectors[a] + vectors[c]
        target /= np.linalg.norm(target) + 1e-10

        best_word, best_sim = None, -1
        for word, vec in vectors.items():
            if word in {a, b, c}:
                continue
            sim = np.dot(target, vec / (np.linalg.norm(vec) + 1e-10))
            if sim > best_sim:
                best_sim, best_word = sim, word

        if best_word == expected:
            correct += 1
        total += 1

    return correct / total if total > 0 else 0

# Typical convergence: accuracy vs. iteration
# Iter 1:   ~30% accuracy (random-ish vectors)
# Iter 10:  ~55% accuracy (structure emerging)
# Iter 25:  ~67% accuracy (most structure captured)
# Iter 50:  ~71% accuracy (fine-tuning)
# Iter 100: ~72% accuracy (diminishing returns)
GloVe Training: Loss Over Iterations

Watch GloVe converge as we minimize the weighted least-squares loss on a tiny toy corpus. Click "Train" to run gradient descent steps.

Epoch 0 | Loss: ?
python
import numpy as np

def train_glove(cooccur, V, d=50, epochs=100, lr=0.05, x_max=100, alpha=0.75):
    """
    cooccur: list of (i, j, X_ij) — nonzero entries
    Returns: word vectors W + W_ctx (sum of both)
    """
    # Initialize
    W = (np.random.rand(V, d) - 0.5) / d
    W_ctx = (np.random.rand(V, d) - 0.5) / d
    b = np.zeros(V)
    b_ctx = np.zeros(V)

    # AdaGrad accumulators
    W_sum = np.ones((V, d))  # init to 1 to avoid division by zero
    W_ctx_sum = np.ones((V, d))
    b_sum = np.ones(V)
    b_ctx_sum = np.ones(V)

    for epoch in range(epochs):
        total_loss = 0.0
        np.random.shuffle(cooccur)

        for i, j, x_ij in cooccur:
            # Weighting
            fw = min((x_ij / x_max) ** alpha, 1.0)

            # Error
            diff = W[i] @ W_ctx[j] + b[i] + b_ctx[j] - np.log(x_ij)
            loss = fw * diff ** 2
            total_loss += loss

            # Gradients
            grad_common = fw * diff
            grad_w = grad_common * W_ctx[j]
            grad_wc = grad_common * W[i]

            # AdaGrad update
            W_sum[i] += grad_w ** 2
            W[i] -= lr * grad_w / np.sqrt(W_sum[i])

            W_ctx_sum[j] += grad_wc ** 2
            W_ctx[j] -= lr * grad_wc / np.sqrt(W_ctx_sum[j])

            b_sum[i] += grad_common ** 2
            b[i] -= lr * grad_common / np.sqrt(b_sum[i])

            b_ctx_sum[j] += grad_common ** 2
            b_ctx[j] -= lr * grad_common / np.sqrt(b_ctx_sum[j])

    return W + W_ctx  # sum of word and context vectors

Complete PyTorch implementation

python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class GloVeDataset(Dataset):
    def __init__(self, cooccur_data, x_max=100, alpha=0.75):
        """cooccur_data: list of (i, j, X_ij) tuples"""
        self.data = cooccur_data
        self.x_max = x_max
        self.alpha = alpha

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        i, j, x = self.data[idx]
        # Compute weight
        weight = (x / self.x_max) ** self.alpha if x < self.x_max else 1.0
        return (torch.tensor(i), torch.tensor(j),
                torch.tensor(x, dtype=torch.float32),
                torch.tensor(weight, dtype=torch.float32))


class GloVe(nn.Module):
    def __init__(self, V, d=300):
        super().__init__()
        self.W = nn.Embedding(V, d)       # word vectors
        self.W_ctx = nn.Embedding(V, d)   # context vectors
        self.b = nn.Embedding(V, 1)       # word biases
        self.b_ctx = nn.Embedding(V, 1)  # context biases

        # Initialize
        for param in self.parameters():
            nn.init.uniform_(param, -0.5/d, 0.5/d)

    def forward(self, i, j, x, weight):
        """
        i, j:     (batch,) — word and context indices
        x:        (batch,) — co-occurrence counts
        weight:   (batch,) — f(X_ij) weights
        """
        w_i = self.W(i)           # (batch, d)
        w_j = self.W_ctx(j)      # (batch, d)
        b_i = self.b(i).squeeze() # (batch,)
        b_j = self.b_ctx(j).squeeze()

        # Prediction: w_i . w_j + b_i + b_j
        pred = (w_i * w_j).sum(dim=1) + b_i + b_j

        # Target: log(X_ij)
        target = torch.log(x)

        # Weighted least squares loss
        loss = weight * (pred - target) ** 2
        return loss.mean()

    def get_vectors(self):
        """Return W + W_ctx as final word vectors."""
        return (self.W.weight + self.W_ctx.weight).detach()

# Training
model = GloVe(V=400000, d=300)
optimizer = torch.optim.Adagrad(model.parameters(), lr=0.05)

for epoch in range(100):
    for i, j, x, w in dataloader:
        loss = model(i, j, x, w)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

vectors = model.get_vectors()  # (400000, 300)
Why does GloVe use the sum W + W̃ as the final word vectors instead of just W?

Chapter 6: The Word2Vec Connection

GloVe's paper made a provocative claim: Word2Vec (Skip-gram with negative sampling) is implicitly factorizing a co-occurrence matrix. The two methods are more similar than they appear.

What Skip-gram NEG implicitly optimizes

Levy and Goldberg (2014) proved (building on the GloVe paper's insight) that Skip-gram with negative sampling, when trained to convergence, satisfies:

wiTj = PMI(i, j) − log k

where PMI is the pointwise mutual information:

PMI(i, j) = log(P(i, j) / (P(i) · P(j))) = log(Xij · |D| / (Xi · Xj))

And GloVe's model says:

wiTj + bi + b̃j = log Xij

These are related! If GloVe's biases absorb the log Xi and log Xj terms:

bi ≈ log Xi,   b̃j ≈ log Xj

Then:

wiTj = log Xij − log Xi − log Xj = log(Xij / (Xi · Xj)) ≈ PMI(i,j) + const

Both models learn dot products that approximate PMI (or log co-occurrence). The difference is in how they optimize:

PropertyWord2Vec (Skip-gram NEG)GloVe
Training dataRaw text (online, streaming)Co-occurrence matrix (precomputed)
ObjectiveBinary classification (pos vs neg)Weighted least squares on log X
Implicit targetPMI − log klog X_ij (with biases absorbing marginals)
Zero entriesHandled via negative samplingSkipped (f(0) = 0)
WeightingSampling frequency (noise dist)Explicit f(X_ij) function
OptimizerSGD on streaming textAdaGrad on shuffled matrix entries
The unification: Both methods learn word vectors whose dot products approximate (shifted, weighted) log co-occurrence statistics. The "neural network" in Word2Vec is doing matrix factorization in disguise. The "global matrix" in GloVe is doing essentially the same prediction task, just from a different angle. This realization unified the field and showed that the count-vs-predict distinction was largely artificial.

Three key differences that still matter

Despite the mathematical similarity, some practical differences remain:

1. How they handle zero co-occurrences: This is perhaps the most important distinction. GloVe skips all entries where Xij = 0 (since log(0) is undefined). Word2Vec handles zeros implicitly through negative sampling — every negative sample is drawn from the noise distribution, which naturally represents the "background" of non-co-occurring words. GloVe never explicitly learns that two words don't co-occur; Word2Vec does, through its negative samples.

2. Online vs. batch: Word2Vec processes the corpus as a stream. It can train on corpora that don't fit in memory. GloVe must first compute the full co-occurrence matrix, which requires a pass over the corpus and ~10 GB of storage for large vocabularies. For very large, streaming datasets, Word2Vec is more practical.

3. Weighting scheme: GloVe's weighting function f(Xij) explicitly caps the influence of very frequent pairs. Word2Vec's analog is the subsampling of frequent words and the noise distribution exponent (3/4 power). Both achieve similar effects but through different mechanisms.

Worked example: the equivalence

Consider word i = "cat" and word j = "fluffy" in a corpus where:

Xcat, fluffy = 200,   Xcat = 50,000,   Xfluffy = 10,000,   |D| = 109

GloVe's target:

log Xij = log(200) = 5.30
wcatTfluffy + bcat + b̃fluffy ≈ 5.30

Skip-gram NEG's implicit target (Levy & Goldberg):

PMI(cat, fluffy) = log(200 × 109 / (50,000 × 10,000)) = log(400) = 5.99
wcatTfluffy ≈ PMI − log(k) = 5.99 − log(5) = 5.99 − 1.61 = 4.38

Both methods produce dot products in the same ballpark (~4-6 for this moderately co-occurring pair). The difference is absorbed by GloVe's biases and the constant shifts.

Skip-gram ↔ GloVe: Same Geometry

Both methods produce vectors in the same region of vector space. The scatter plot shows GloVe vs. Skip-gram dot products for word pairs — they correlate strongly.

What does Skip-gram with negative sampling implicitly factorize?

Chapter 7: Results

GloVe was evaluated on three tasks: word analogies, word similarity, and named entity recognition (NER).

Word analogy task

Using the standard analogy test set (semantic + syntactic, ~19K questions):

ModelDimTraining DataAccuracy %
SVD (on X)3006B36.7
SVD (on log X)3006B54.6
Word2Vec (SG)3006B65.6
Word2Vec (CBOW)3006B63.6
GloVe3006B71.7
GloVe30042B75.0

GloVe at 300 dimensions on 6B words outperforms Word2Vec on the same data by 6 points. On 42B words (Common Crawl), it reaches 75% — a large improvement from the 65.6% of Word2Vec on 6B.

Word similarity task

Spearman correlation between model's cosine similarities and human similarity judgments on standard datasets (WordSim-353, etc.). GloVe achieved 0.769 on WordSim-353, competitive with the best Word2Vec models.

Named Entity Recognition

Using word vectors as features for a CRF-based NER system on CoNLL-2003:

FeaturesF1 Score
Discrete features only88.4
+ SVD vectors (d=50)89.3
+ Word2Vec (d=50)90.1
+ GloVe (d=50)90.5
+ GloVe (d=300)91.2

GloVe vectors as additional features improved NER F1 from 88.4 to 91.2 — a substantial gain from unsupervised pretraining.

Scaling behavior

The paper studied how performance scales with corpus size, vector dimension, window size, and training time:

GloVe vs. Word2Vec: Analogy Accuracy

Comparison across methods and training data sizes. GloVe consistently outperforms Word2Vec on the same data.

Training efficiency: GloVe on 6B words with d=300 converges in about 50 iterations over ~1B nonzero entries. Total training time: a few hours on 8 CPU cores. The co-occurrence matrix computation adds an upfront cost, but the optimization itself is embarrassingly parallel — each (i, j) entry is independent — making GloVe well-suited to multi-core and distributed training.

Dimension vs. accuracy: the plateau

The paper's experiments on Wikipedia+Gigaword 6B showed a clear pattern:

Dimension dAnalogy accuracy %Training time (relative)
5054.30.2x
10064.00.4x
20069.10.7x
30071.71.0x
40072.31.3x
50072.51.7x

Going from 50 to 300 dimensions adds 17 points. Going from 300 to 500 adds only 0.8 points. The "sweet spot" is d = 300 — which is why most pre-trained word vectors use this dimension.

Window size: semantic vs. syntactic

An interesting finding: window size affects what the vectors learn:

This makes sense: a word's immediate neighbors are syntactically constrained (adjectives before nouns, determiners before adjectives), while distant words in the same sentence are thematically related. GloVe's default c = 10 favors semantic similarity, which is more useful for most downstream tasks.

GloVe on different corpora

The paper tested GloVe on multiple corpora to study the effect of data quality and quantity:

CorpusTokensAnalogy Accuracy %Notes
Wikipedia 20141.6B64.7Clean, encyclopedic text
Wikipedia + Gigaword 56B71.7Clean + newswire
Common Crawl (42B)42B75.0Noisy but massive
Common Crawl (840B)840BEven noisier; used for released vectors

Two observations: (1) More data always helps, even when the additional data is noisy web text. (2) Clean data is more efficient per token — Wikipedia's 1.6B tokens give 64.7%, while you need 6B tokens of mixed-quality data to reach 71.7%. Quality matters, but quantity can compensate.

Sentence-level and document-level usage

While GloVe produces word-level vectors, they can be composed into sentence or document representations:

python
# Simple document vector: average word vectors
def doc_vector(words, glove, dim=300):
    vecs = [glove[w] for w in words if w in glove]
    if len(vecs) == 0:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

# TF-IDF weighted average (better for information retrieval)
def weighted_doc_vector(words, glove, idf_weights, dim=300):
    vecs, weights = [], []
    for w in words:
        if w in glove and w in idf_weights:
            vecs.append(glove[w])
            weights.append(idf_weights[w])
    if len(vecs) == 0:
        return np.zeros(dim)
    return np.average(vecs, axis=0, weights=weights)

Simple averaging works surprisingly well for short texts (tweets, sentences). For longer documents, TF-IDF weighting or SIF (Smooth Inverse Frequency) weighting by Arora et al. (2017) gives significant improvements by down-weighting common words.

GloVe vs. Word2Vec: a detailed comparison

Having studied both methods in detail, here is a comprehensive comparison:

AspectWord2Vec (Skip-gram NEG)GloVe
ObjectiveBinary classification (pos vs neg)Weighted least squares on log X
Data formatRaw text corpus (streaming)Co-occurrence matrix (precomputed)
Zero co-occurrencesHandled via negative samplingSkipped (f(0) = 0)
Distance weightingNone within window1/d harmonic weighting
Frequency handlingSubsampling + noise dist f^0.75Weighting function f(X) = (X/100)^0.75
OptimizerSGD with linear LR decayAdaGrad
ParallelismHogwild (async, lock-free)Embarrassingly parallel (independent entries)
MemoryO(V · d) — just embeddingsO(V · d + nnz) — embeddings + sparse matrix
Final vectorsW_in (input embeddings)W + W_ctx (sum of both)
Best accuracy (6B)66% (d=1000)72% (d=300)
Best accuracy (42B)~70% (estimated)75% (d=300)

In practice, the difference between GloVe and Word2Vec is small on most downstream tasks. The choice often depends on:

What both methods get wrong

Both GloVe and Word2Vec share fundamental limitations that contextual embeddings (ELMo, BERT) would later address:

  1. Polysemy: "Bank" gets one vector, regardless of whether it means "river bank" or "financial bank." Each word type has exactly one representation.
  2. Compositionality: Neither method has a principled way to compose word vectors into phrase or sentence meanings. Simple averaging works for some tasks but fails for negation, conditionals, and complex syntax.
  3. Out-of-vocabulary: Words not in the training vocabulary have no representation. Misspellings, neologisms, and rare technical terms are invisible.
  4. Position insensitivity: "The dog bit the man" and "The man bit the dog" would get similar representations (both contain the same words), even though they have opposite meanings.

These limitations motivated the move to contextualized representations, where each token gets a different vector depending on the full input sequence. But GloVe's core insight — that ratios of co-occurrence probabilities encode meaning — remains foundational to understanding how all embedding methods work.

GloVe's influence on modern AI

GloVe's ideas have been absorbed into modern deep learning in subtle ways:

Practical recipe for choosing word embeddings (2024)

In modern NLP, the choice is usually straightforward:

SituationRecommendationWhy
Full NLP pipelineUse a transformer (BERT, LLaMA)Contextual embeddings dominate on every benchmark
Simple baseline / prototypePre-trained GloVe 300dFree, fast, no GPU needed, surprisingly competitive
Domain-specific (e.g., biomedical)Train Word2Vec on domain corpusEasy to train, captures domain terminology
Need OOV handlingFastTextCharacter n-grams handle any word
MultilingualMultilingual BERT or FastTextCross-lingual transfer
Edge deployment (no GPU)Pre-trained GloVe 50d or 100dTiny memory footprint, CPU-friendly

GloVe and Word2Vec remain relevant as baselines, for education, and for resource-constrained settings. Their simplicity — just a lookup table of vectors — makes them deployable anywhere, including embedded devices, browsers, and mobile apps. A 50d GloVe model fits in 80 MB; a 300d BERT model requires 400 MB. For many applications, the simpler model is good enough.

Evaluating word vectors: beyond analogies

The analogy task (king − man + woman = queen) became the standard evaluation, but it has limitations. The paper also evaluated on several other tasks:

Word similarity: Compute cosine similarity between word pairs and compare to human judgments using Spearman rank correlation. Datasets include WordSim-353, SimLex-999, and MEN. GloVe achieves 0.77 on WordSim-353 (human agreement is ~0.75).

Extrinsic evaluation: Use word vectors as features for a downstream task and measure task performance. The paper chose NER because it is well-understood and has standard benchmarks. The +2.8 F1 improvement from GloVe vectors (88.4 → 91.2) demonstrated that intrinsic quality (analogy accuracy) translates to extrinsic performance.

python
# Evaluating word vectors on similarity task
import numpy as np
from scipy.stats import spearmanr

def evaluate_similarity(vectors, test_pairs):
    """
    test_pairs: list of (word1, word2, human_score) tuples
    Returns: Spearman correlation with human judgments
    """
    model_sims, human_sims = [], []
    for w1, w2, human in test_pairs:
        if w1 not in vectors or w2 not in vectors:
            continue
        v1, v2 = vectors[w1], vectors[w2]
        cos_sim = v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2))
        model_sims.append(cos_sim)
        human_sims.append(human)

    rho, p_value = spearmanr(model_sims, human_sims)
    return rho

# Typical results:
# GloVe 300d on WordSim-353:  rho = 0.769
# Word2Vec 300d on WordSim-353: rho = 0.720
# Human agreement: rho ≈ 0.750

The co-occurrence matrix as a window into language

Building and examining the co-occurrence matrix reveals fascinating properties of natural language:

These statistical properties explain why GloVe's design choices work: the weighting function handles the power-law distribution, the log transform handles the vast range of counts, and the biases handle the variation in word frequencies. Every design decision in GloVe is motivated by the empirical statistics of natural language co-occurrence.

GloVe's theoretical contribution

Beyond the practical algorithm, GloVe made an important theoretical contribution to understanding word vectors. The paper argued that the right way to think about word meaning is through ratios of co-occurrence probabilities, not through raw probabilities or raw counts. This perspective has three consequences:

  1. Explains linear structure: Ratios are multiplicative. In log space, multiplicative relationships become additive. Additive relationships in the vector space = linear structure. This explains why king − man + woman ≈ queen: the log-ratio of co-occurrence probabilities is additive.
  2. Unifies count and predict: Both LSA (which uses counts) and Word2Vec (which predicts) are implicitly capturing the same ratios, just from different angles. GloVe's derivation shows that the ratio is the fundamental quantity; count-based and predictive methods are two ways to estimate it.
  3. Motivates the objective: Most previous work used ad hoc objectives (SVD on raw counts, cross-entropy on predictions). GloVe's objective is derived from the ratio requirement. This principled derivation is what makes GloVe's paper theoretically influential, even for practitioners who prefer Word2Vec in practice.

The paper's equation F(wi − wj, w̃k) = Pik/Pjk is one of the most cited equations in NLP. It answers a question that Word2Vec left open: why do word vectors have linear structure? GloVe's answer: because the underlying semantic signal (co-occurrence ratios) is inherently log-linear, and the training objective preserves this structure.

How does GloVe compare to Word2Vec on the same training data (6B words, 300 dimensions)?

Chapter 8: Practical Details

GloVe became one of the most widely used word embedding methods. Here are the practical details for training and using GloVe vectors.

Hyperparameters

HyperparameterDefaultEffect
Dimension d300Higher is better up to ~300, then diminishing returns
Window size c10Larger for semantic tasks, smaller for syntactic
xmax100Cap for weighting function; higher = more weight to frequent pairs
α0.75Sublinear weighting exponent; 0.75 works well universally
Learning rate0.05With AdaGrad, which adapts per-parameter
Epochs50-100More for smaller corpora; 1 pass suffices for very large data
Min count5Words appearing fewer times are discarded

Pre-trained vectors

Stanford released pre-trained GloVe vectors that became the standard initialization for NLP models from 2014-2018:

DatasetTokensVocabDims available
Wikipedia + Gigaword 56B400K50, 100, 200, 300
Common Crawl (uncased)42B1.9M300
Common Crawl (cased)840B2.2M300
Twitter27B1.2M25, 50, 100, 200

Using GloVe vectors in practice

python
import numpy as np

def load_glove(filepath, dim=300):
    """Load pre-trained GloVe vectors from text file."""
    word2vec = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vec = np.array([float(x) for x in parts[1:]])
            word2vec[word] = vec
    return word2vec

# Usage
glove = load_glove('glove.6B.300d.txt')
print(glove['king'].shape)  # (300,)

# Cosine similarity
def cosine_sim(a, b):
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_sim(glove['king'], glove['queen']))    # ~0.75
print(cosine_sim(glove['king'], glove['banana']))   # ~0.12

# As PyTorch embedding initialization
import torch
import torch.nn as nn

embedding = nn.Embedding(V, 300)
for word, idx in vocab.items():
    if word in glove:
        embedding.weight.data[idx] = torch.tensor(glove[word])
# Freeze or fine-tune depending on task and dataset size
GloVe vs. Word2Vec in practice: Despite GloVe's theoretical advantages, many practitioners found Word2Vec and GloVe performed similarly on downstream tasks when trained on comparable data. The choice often came down to convenience: GloVe's pre-trained vectors were more readily available and easier to load (simple text format). Word2Vec's gensim library was easier to train on custom data. Both were superseded by contextual embeddings (ELMo 2018, BERT 2018) for most tasks.

Fine-tuning vs. freezing

When using pre-trained GloVe vectors as initialization for a downstream task, you have two options:

StrategyWhen to useProsCons
FreezeSmall dataset (<10K examples)Prevents overfitting; preserves pretrained qualityCannot adapt to domain-specific usage
Fine-tuneLarge dataset (>100K examples)Adapts vectors to your specific taskMay overfit on small data; loses general knowledge

A common compromise: start with frozen embeddings for a few epochs (letting the rest of the model warm up), then unfreeze and fine-tune with a small learning rate.

python
# Common pattern: freeze then fine-tune
import torch.nn as nn

embedding = nn.Embedding(V, 300)

# Load GloVe vectors
for word, idx in vocab.items():
    if word in glove:
        embedding.weight.data[idx] = torch.tensor(glove[word])

# Phase 1: Freeze embeddings, train classifier layers
embedding.weight.requires_grad = False
for epoch in range(5):
    train_epoch(model)  # only classifier weights update

# Phase 2: Unfreeze, fine-tune everything with low LR
embedding.weight.requires_grad = True
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)  # 10x lower LR
for epoch in range(10):
    train_epoch(model)  # all weights update, including embeddings

GloVe file format

GloVe vectors are distributed as plain text files — one word per line, space-separated values. This is simpler than Word2Vec's binary format (which uses struct packing) but larger on disk.

text
# Format of glove.6B.300d.txt:
# Each line: word val1 val2 ... val300
the 0.418 0.249 -0.412 0.122 ... (300 floats)
, 0.013 0.189 -0.350 0.076 ...
. 0.152 0.304 -0.134 0.014 ...
of 0.370 0.210 -0.310 0.328 ...
# ... 400,000 lines total
# File size: 1.04 GB for 300d, 347 MB for 100d

Out-of-vocabulary (OOV) words

GloVe has no mechanism for words not in the pre-trained vocabulary. Common strategies for handling OOV words:

Memory requirements

Loading GloVe vectors requires substantial memory:

Memory = V × d × 4 bytes (float32)
glove.6B.300d: 400K × 300 × 4 = 480 MB
glove.840B.300d: 2.2M × 300 × 4 = 2.64 GB

For deployment in memory-constrained environments, techniques like quantization (float16 or int8) or dimensionality reduction (PCA from 300d to 100d) are common.

Pre-trained GloVe: Nearest Neighbors

Type a word to see its nearest neighbors in GloVe space (simulated with common examples). Click preset words to explore.

What ultimately replaced GloVe and Word2Vec as the standard word representation in NLP?

Chapter 9: Connections

What GloVe built on

Latent Semantic Analysis (Deerwester et al., 1990): SVD on a term-document matrix. GloVe improved on LSA by using a word-word matrix instead of word-document, log co-occurrence instead of raw counts, and a weighted objective instead of unweighted SVD.

Word2Vec (Mikolov et al., 2013): The direct predecessor and competitor. GloVe explicitly positions itself as a synthesis of Word2Vec's strengths (local context prediction, linear structure) and LSA's strengths (global statistics). See our Word2Vec veanor and negative sampling veanor.

HAL (Hyperspace Analogue to Language, Lund & Burgess, 1996): An early word-word co-occurrence approach. GloVe's matrix is similar but uses harmonic weighting and a principled objective.

Why GloVe beats SVD on X

A natural question: if GloVe trains on the co-occurrence matrix X, why not just do SVD directly on X (or log X)? The paper tested this and found several reasons GloVe wins:

  1. Weighting. SVD treats all entries equally. GloVe's f(Xij) gives appropriate weight to each entry. Very frequent pairs don't dominate; rare pairs don't add noise.
  2. Log transform. SVD on raw X performs poorly (36.7% accuracy). SVD on log X is much better (54.6%). GloVe naturally operates in log space (the target is log Xij).
  3. Zero entries. SVD must handle the V2 matrix including all zeros. GloVe skips zeros (f(0) = 0). This is both computationally cheaper and statistically better — zeros carry no information and treating them as targets introduces bias.
  4. Biases. GloVe's bias terms bi, b̃j absorb word frequency, separating it from semantic content. SVD has no analog — frequency contaminates the singular vectors.
python
# Comparison: SVD on log(X) vs. GloVe
from scipy.sparse.linalg import svds
import numpy as np
from scipy.sparse import csr_matrix

# SVD approach (log transform, shift by 1 to handle zeros)
X_log = np.log(X.toarray() + 1)              # (V, V)
U, S, Vt = svds(csr_matrix(X_log), k=300)  # truncated SVD
svd_vectors = U * np.sqrt(S)                  # (V, 300)

# SVD accuracy: ~55% on analogies
# GloVe accuracy: ~72% on analogies
# Why the gap? Weighting, biases, and skip-zero optimization

What GloVe enabled

The embedding revolution (2014-2018): GloVe's pre-trained 300d vectors became the default initialization for nearly every NLP model: sentiment analysis, NER, relation extraction, question answering. The idea of "transfer learning from unsupervised pretraining" was proven by GloVe and Word2Vec before it was scaled up by ELMo and BERT.

Levy & Goldberg (2014): Formalized the Word2Vec-GloVe connection, showing both implicitly factorize PMI matrices. This unified the count-predict debate and led to a deeper understanding of what makes word embeddings work.

FastText (2017), ELMo (2018), BERT (2018): Each extended the word embedding paradigm in its own direction. FastText added subwords. ELMo added context-dependence. BERT added bidirectional context with transformers. All stood on the foundation that GloVe and Word2Vec established.

The evolution of word representations

EraMethodRepresentationKey property
1990sLSA (SVD on counts)Static, count-basedGlobal statistics, poor analogies
2013Word2VecStatic, prediction-basedLinear analogies, local context
2014GloVeStatic, hybridGlobal + local, best analogies
2017FastTextStatic, subword-basedOOV handling, morphology
2018ELMoContextual (LSTM)Different vectors per context
2018BERTContextual (Transformer)Bidirectional, fine-tunable
2020+GPT-3/4, LLaMAContextual (large Transformer)In-context learning, emergent abilities

Each step built on the previous. GloVe showed that the right objective matters more than the algorithm. BERT showed that context matters more than static vectors. GPT-4 showed that scale matters more than architecture. But the foundational insight — that useful representations emerge from prediction tasks on raw text — traces directly back to Word2Vec and GloVe.

GloVe's influence on loss design

GloVe's weighted least-squares objective influenced later work in surprising ways:

The full GloVe pipeline in code

bash
# The official GloVe training pipeline (Stanford release)
# Download from https://nlp.stanford.edu/projects/glove/

# Step 1: Build vocabulary
./vocab_count -min-count 5 < corpus.txt > vocab.txt
# Output: 400,000 words above threshold

# Step 2: Build co-occurrence matrix
./cooccur -window-size 10 -vocab-file vocab.txt < corpus.txt > cooccur.bin
# Output: ~1B nonzero entries, ~10 GB binary file

# Step 3: Shuffle (for SGD convergence)
./shuffle < cooccur.bin > cooccur.shuf.bin

# Step 4: Train
./glove -vector-size 300 -threads 8 -iter 50 \
  -eta 0.05 -alpha 0.75 -x-max 100 \
  -input-file cooccur.shuf.bin -vocab-file vocab.txt \
  -save-file vectors
# Output: vectors.txt (word + 300 floats per line)
python
# Using pre-trained GloVe with numpy (no dependencies)
import numpy as np

def load_glove(path):
    vectors = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            vectors[parts[0]] = np.array([float(x) for x in parts[1:]])
    return vectors

# Common operations
glove = load_glove('glove.6B.300d.txt')

# Cosine similarity
def sim(a, b):
    return glove[a] @ glove[b] / (np.linalg.norm(glove[a]) * np.linalg.norm(glove[b]))

print(f"king-queen: {sim('king','queen'):.3f}")  # 0.751
print(f"cat-dog: {sim('cat','dog'):.3f}")      # 0.762

# Analogy: king - man + woman = ?
target = glove['king'] - glove['man'] + glove['woman']
target /= np.linalg.norm(target)
best = max(
    ((w, target @ v / np.linalg.norm(v)) for w, v in glove.items()
     if w not in {'king', 'man', 'woman'}),
    key=lambda x: x[1]
)
print(f"king - man + woman = {best[0]} ({best[1]:.3f})")
# king - man + woman = queen (0.891)

Known limitations of GloVe

Cheat sheet

Core idea
Train on the co-occurrence matrix X so that w_i · w_j + b_i + b_j ≈ log X_ij
Key insight
Meaning is in the ratio P(k|ice)/P(k|steam), not the raw probabilities. Ratios → exponentials → log-bilinear model
Weighting
f(x) = min((x/x_max)^0.75, 1). Caps frequent pairs, ignores zero entries
Final vectors
W + W_ctx (sum of word and context vectors). 300d, trained by AdaGrad
Impact
75% analogy accuracy. Standard NLP initialization 2014-2018. Unified count-vs-predict debate
What is GloVe's core mathematical insight about how word meaning is encoded in co-occurrence statistics?