A Contextual Introduction — tracing the paradigm shift from static word vectors (Word2Vec, GloVe) to context-dependent representations (ELMo, GPT, BERT). Why one vector per word isn't enough.
Consider the word "bank." In English, it has at least these meanings:
| Sentence | Meaning of "bank" |
|---|---|
| "I deposited money at the bank." | Financial institution |
| "She sat on the river bank." | Edge of a river |
| "The pilot had to bank the plane." | Tilt during a turn |
| "Don't bank on it." | Rely on / count on |
Same spelling, same pronunciation, completely different meanings. This phenomenon is called polysemy — a single word form carrying multiple related (or unrelated) meanings. Polysemy is not rare; it's the norm. The average English word has 2-3 dictionary senses. Common words have far more: "run" has over 600 senses in the OED.
Humans handle polysemy effortlessly. When you read "I deposited money at the bank," you instantly know "bank" means a financial institution — you don't even consider the river meaning. How? Because the surrounding words ("deposited," "money") activate the financial sense and suppress the others. Your brain constructs the meaning of "bank" in context, not in isolation.
But computers don't have this ability by default. Traditional NLP systems represent each word as a fixed vector or a dictionary entry. They can look up "bank" and find a list of possible meanings, but they can't automatically choose the right one for a given sentence without additional disambiguation logic. The contextual representation revolution — the subject of Smith's paper — gives computers the same ability humans have: the meaning of a word is computed dynamically from its context.
Now here's the problem for NLP: if you represent each word as a single fixed vector (as Word2Vec and GloVe do), where does "bank" go in vector space? Near "money" and "finance"? Near "river" and "shore"? Near "tilt" and "angle"? It can't be near all of them simultaneously — those regions of vector space are far apart. The static embedding must compromise, placing "bank" somewhere in between, equally bad at representing all its senses.
This isn't just a theoretical concern. Consider building a question-answering system. A user asks: "What did the pilot do at the bank?" With static embeddings, "bank" has one representation that mixes financial and river senses. The system might retrieve answers about banking transactions when the user meant a river bank. Without context, there's no way to disambiguate. The word alone isn't enough — you need the sentence.
The scale of the problem is staggering. Zipf's law tells us that the most frequent words are the most polysemous. The top 100 most common English words average 10+ senses each. Words like "run," "set," "get," "take," and "make" each have dozens of distinct meanings. These are the words NLP systems encounter most often — and they're exactly the words that static embeddings handle worst.
In a static embedding space, "bank" gets ONE fixed position (orange dot) that compromises between its multiple meanings. Contextual embeddings give "bank" a DIFFERENT position depending on the sentence. Click the sentences to see how a contextual model would position "bank" differently for each meaning.
Smith's paper surveys the field's progression from static to contextual representations — a shift that he argues is "the most significant empirical advance in NLP in the past decade." This isn't just an incremental improvement. It's a fundamental change in how we think about meaning: meaning is not a property of words, but of words in context.
The paper is particularly valuable because Smith writes as a linguist-turned-ML-researcher. He doesn't just describe the models — he explains why the shift was necessary, grounding it in linguistic theory. The distributional hypothesis ("you shall know a word by the company it keeps" — Firth, 1957) underlies both static and contextual embeddings. But static embeddings implement it incompletely: they capture what company a word typically keeps, while contextual embeddings capture what company it keeps right now.
The progression from Word2Vec (2013) through ELMo (2018) to BERT (2018) happened in just five years — a remarkably fast paradigm shift for a field. Each step was motivated by a clear limitation of the previous approach, and each step produced immediate, measurable improvements on downstream tasks.
Let's trace this progression, starting from the static world.
Before we can appreciate contextual representations, we need to understand what they replaced. The static embedding era (2013-2018) was built on a deceptively simple idea: words that appear in similar contexts should have similar meanings. This is the distributional hypothesis (Harris 1954, Firth 1957).
Word2Vec learns a d-dimensional vector for each word by training a shallow neural network to predict context words from target words (Skip-gram) or target words from context (CBOW). The key insight: the hidden weights of this network, after training, are the word embeddings.
After training on billions of words, the resulting vectors exhibit remarkable algebraic properties:
This means the vector space encodes semantic relationships as directions: the "gender" direction (man→woman) can be applied to "king" to get "queen." Similar arithmetic works for country-capital, tense, plural, and many other relationships.
Let's be precise about how this works. The "gender" direction in the vector space is approximately vwoman - vman. When you add this direction to vking, you move from the "male ruler" region to the "female ruler" region, arriving near vqueen. The fact that such regular semantic structures emerge from a simple prediction task — without any explicit semantic supervision — was the breakthrough that launched the embedding era.
python # Word analogies: king - man + woman ≈ queen from gensim.models import KeyedVectors wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors.bin', binary=True) # Analogy: king : man :: ? : woman result = wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=3) # [('queen', 0.71), ('monarch', 0.62), ('princess', 0.59)] # This works because relationships are encoded as DIRECTIONS: # v_king - v_man ≈ v_queen - v_woman (gender direction) # v_paris - v_france ≈ v_tokyo - v_japan (capital direction) # v_bigger - v_big ≈ v_faster - v_fast (comparative direction)
GloVe approaches the same goal from a different angle: instead of prediction, it directly factorizes the global word co-occurrence matrix. If words i and j co-occur frequently, their dot product should be high:
Where Xij counts how often word i appears near word j in the training corpus. GloVe produces embeddings of similar quality to Word2Vec but with the advantage that the optimization objective is convex (no local minima).
The co-occurrence matrix X is typically built from a large corpus (Wikipedia + Gigaword, ~6 billion tokens). Each entry Xij counts how many times word i appears within a context window (usually 10 words) of word j. Words that frequently co-occur (like "ice" and "cream") get high values. The GloVe training procedure then finds vectors whose dot products approximate the log of these counts.
A concrete example helps build intuition. The word "France" co-occurs frequently with "Paris," "wine," "Eiffel," and "language." The word "Japan" co-occurs frequently with "Tokyo," "sushi," "emperor," and "language." Because "France" and "Japan" share many co-occurrence patterns (both co-occur with "language," "culture," "country," etc.), their vectors end up close together in the embedding space — both in the "country" region. But because they also have distinct co-occurrence patterns (France with "wine," Japan with "sushi"), they're close but not identical.
There were attempts to address polysemy within the static embedding framework. Multi-sense embeddings (Reisinger and Mooney, 2010; Neelakantan et al., 2014) learned multiple vectors per word — one for each sense. But this required pre-specifying the number of senses (how many senses does "run" have? 3? 10? 600?) and didn't capture the smooth continuum of meaning that words exhibit in practice. A word's meaning doesn't switch discretely between a fixed set of senses — it varies continuously with context.
Another limitation of static embeddings that Smith highlights is the out-of-vocabulary (OOV) problem. Static embeddings have a fixed vocabulary learned during training. Any word not in the vocabulary (misspellings, rare technical terms, neologisms) gets no representation at all. Contextual models handle this through subword tokenization — breaking unknown words into known pieces — and through the contextual nature of the representations themselves.
Smith identifies three fundamental limitations of static embeddings that motivated the move to contextual representations. Each limitation corresponds to an aspect of meaning that static embeddings fundamentally cannot capture:
| Limitation | Example | Consequence |
|---|---|---|
| No polysemy handling | "bank" gets one vector regardless of context | Ambiguous words are poorly represented |
| No compositionality | "hot dog" ≠ vhot + vdog | Multi-word expressions lose meaning |
| No syntactic sensitivity | "dog bites man" vs "man bites dog" have same word vectors | Word order information is lost |
Explore a simplified 2D embedding space. Each word is a fixed point. Notice how "bank" sits uncomfortably between the financial and nature clusters. Drag the slider to see how close "bank" is to different word clusters.
python # Static embedding: "bank" gets ONE vector import gensim.downloader as api model = api.load("glove-wiki-gigaword-300") # Same vector regardless of context bank_vec = model["bank"] # shape: (300,) — always the same # Check similarity to different meanings print(model.similarity("bank", "money")) # 0.43 — some financial print(model.similarity("bank", "river")) # 0.36 — some nature print(model.similarity("bank", "tilt")) # 0.12 — almost nothing # The static embedding compromises: moderate similarity to # financial terms, moderate to nature, low to aviation
Despite these limitations, static embeddings were transformative. Before Word2Vec, NLP systems used one-hot encodings — 50,000-dimensional sparse vectors with no notion of similarity. Static embeddings compressed this to 300 dense dimensions where similarity was meaningful. They just couldn't handle the fact that meaning depends on context.
To appreciate the magnitude of this shift: before Word2Vec, the standard approach to representing "cat" and "dog" gave them zero similarity (different one-hot indices). After Word2Vec, "cat" and "dog" had cosine similarity of ~0.76 — correctly reflecting their semantic relatedness. This was revolutionary for every NLP task: sentiment analysis, machine translation, question answering, and more.
The specific training procedure for Word2Vec Skip-gram works as follows: for each word in the training corpus, take a window of surrounding words (typically 5 on each side). The model learns to predict these context words from the target word. After training on billions of word pairs, the hidden weights of the prediction network become the word vectors.
python # Word2Vec: training mechanics # Input: "The cat sat on the mat" # For target "cat" with window=2: # Training pairs: (cat, The), (cat, sat), (cat, on) # The model learns: P(context | target) = sigmoid(v_context · v_target) import gensim model = gensim.models.Word2Vec(sentences, vector_size=300, window=5) # The resulting vectors capture semantic relationships # model.wv.most_similar("king", topn=5) # → [("queen", 0.71), ("prince", 0.68), ("monarch", 0.66), ...] # But ALL occurrences of "bank" contribute to ONE vector: # "money bank" + "river bank" + "bank shot" → compromised average
In February 2018, Peters et al. introduced ELMo (Embeddings from Language Models), and contextual word representations were born. The idea was elegant: instead of learning one vector per word type, learn a function that produces a vector for each word token — a different vector depending on the surrounding sentence.
The core insight was hiding in plain sight: language models already produce contextual representations. An LSTM language model, as it processes a sentence word by word, builds up a hidden state that depends on all previous words. This hidden state is already a context-dependent representation of the current position. Why not use it as a word embedding?
The answer to "why didn't anyone think of this sooner?" is partly computational. In 2013-2017, language models were small and trained on limited data. Their hidden states weren't rich enough to be useful as general-purpose embeddings. By 2018, language models had grown large enough (93M parameters, trained on 1 billion words) that their internal representations had become genuinely useful — encoding syntax, semantics, and world knowledge.
The name ELMo — Embeddings from Language Models — captures the key idea: use a pre-trained language model as an embedding function. The language model is trained to predict the next word in a sequence (the same objective used in GPT, but with an LSTM instead of a Transformer). After training, the model's internal hidden states become the contextual embeddings.
ELMo uses a two-layer bidirectional LSTM trained as a language model:
The training objective is the joint log-likelihood of both directions:
Each direction is a standard language model. The forward LSTM produces hidden states ht→ capturing left context; the backward LSTM produces ht← capturing right context. The concatenation [ht→; ht←] captures both sides.
Let's walk through the dimensions concretely. ELMo uses a 2-layer LSTM with 4096 hidden units per direction, projected down to 512 dimensions. At each token position t, the forward LSTM produces a 512-dimensional vector ht→, and the backward LSTM produces a 512-dimensional vector ht←. Concatenation gives a 1024-dimensional vector at each of 3 levels (character embedding + 2 LSTM layers). The final ELMo representation is a learned weighted sum across all 3 levels.
python # ELMo internals: dimensions at each level # Level 0 (character CNN): projects characters → 512-dim # Input: character ids for each word # Output: h_0 ∈ R^512 per token # Level 1 (LSTM layer 1): # Forward: h_1_fwd ∈ R^512 per token (left context) # Backward: h_1_bwd ∈ R^512 per token (right context) # Concat: h_1 = [h_1_fwd; h_1_bwd] ∈ R^1024 # Level 2 (LSTM layer 2): # Forward: h_2_fwd ∈ R^512 per token # Backward: h_2_bwd ∈ R^512 per token # Concat: h_2 = [h_2_fwd; h_2_bwd] ∈ R^1024 # Final ELMo: γ * (s_0*h_0 + s_1*h_1 + s_2*h_2) # where s_j are softmax-normalized per-task weights # and γ is a per-task scalar
Where sj are softmax-normalized layer weights (learned during fine-tuning, not pre-training), γ is a scalar, and ht,j is the concatenated hidden state at layer j. This gives each task a different view of the representations.
python # ELMo: different vectors for "bank" in different contexts from allennlp.modules.elmo import Elmo elmo = Elmo(options_file, weight_file, num_output_representations=1) sent1 = ["I", "went", "to", "the", "bank", "to", "deposit", "money"] sent2 = ["I", "sat", "by", "the", "river", "bank"] # "bank" in sent1: shape (1024,) — financial context # "bank" in sent2: shape (1024,) — river context # These are DIFFERENT vectors! cosine similarity ≈ 0.6 # (vs 1.0 for static embeddings, which are identical)
ELMo's two LSTMs are trained independently — the forward LSTM never sees the backward LSTM's outputs, and vice versa. They're simply concatenated after the fact. This means at each layer, the representation of a word is based on either left context or right context, never both simultaneously. Information from both directions only meets at the concatenation point.
This is fundamentally different from BERT, where self-attention lets each token attend to all other tokens at every layer. BERT's bidirectionality is deep — left and right context interact at every layer. ELMo's is shallow — they interact only at the final concatenation.
To see why this matters, consider the sentence: "She put the book on the bank of the river." The word "bank" needs both "book" (left context) and "river" (right context) to be properly understood. In ELMo, the forward LSTM sees "She put the book on the bank" and doesn't yet know about "river." The backward LSTM sees "river the of bank the on" (processing right to left) and doesn't know about "book." Each direction has partial information. Only at the concatenation do both signals combine — but they never interact during processing. In BERT, when computing "bank"'s representation, the self-attention mechanism simultaneously considers both "book" and "river" (and every other word), allowing these context signals to interact and refine each other through multiple layers.
Watch how ELMo processes a sentence through forward (teal) and backward (orange) LSTMs. At each position, the two directions are concatenated to form the contextual representation. Click "Animate" to see the sequential processing.
ELMo improved state-of-the-art across six NLP benchmarks — not by a little, but by 6-25% relative error reduction. The improvements were largest on tasks requiring understanding of word sense (word sense disambiguation improved by 20%) and syntax (dependency parsing improved by 11%). This confirmed that contextual representations capture information that static embeddings fundamentally cannot.
The most striking result was on the word sense disambiguation task. Static embeddings had plateaued at ~68% accuracy — giving "bank" one vector means the model guesses randomly between financial and river senses. ELMo jumped to ~88% — because it gives "bank" different vectors in different contexts, the disambiguation is built into the representation itself.
python # ELMo vs static embeddings: practical difference # Task: classify "bank" as financial or river # With static embeddings (GloVe): # Input: v("bank") — same vector regardless of context # → Model must rely entirely on OTHER features (nearby words) # → Accuracy: ~68% # With ELMo: # Input: ELMo("bank" in "deposited money at the bank") # vs ELMo("bank" in "sat by the river bank") # → These are DIFFERENT vectors — context is baked in # → A simple linear classifier can distinguish them # → Accuracy: ~88%
ELMo also revealed something profound about layer specialization in deep networks. The first LSTM layer primarily captures syntax — its representations are best for POS tagging and parsing. The second layer primarily captures semantics — its representations are best for word sense disambiguation and sentiment analysis. The weighted sum across layers lets each task pick its own emphasis, which is why ELMo improves diverse tasks simultaneously.
Just a few months after ELMo, Radford et al. (2018) at OpenAI took a different approach. Instead of LSTMs, they used the Transformer decoder — and instead of feature extraction, they proposed fine-tuning as the primary transfer mechanism. This was GPT (Generative Pre-trained Transformer).
GPT's two key innovations relative to ELMo were architectural (Transformers instead of LSTMs) and methodological (fine-tuning instead of feature extraction). The Transformer architecture, introduced by Vaswani et al. in 2017, replaced the sequential processing of LSTMs with parallel self-attention. This was a game-changer for two reasons: (1) self-attention captures long-range dependencies more effectively than LSTM memory, and (2) the entire sequence can be processed in parallel during training, making it dramatically faster to train on large datasets.
GPT uses 12 layers of Transformer decoder blocks. Each block contains masked self-attention (where each token can only attend to positions to its left) and a feed-forward network. The masking is crucial: it allows GPT to be trained as a language model (predict the next token) while using the Transformer's parallel training efficiency.
Each Transformer block processes the entire sequence in parallel. At each position, the self-attention mechanism computes a weighted average of all previous positions (masked to prevent looking ahead), and the feed-forward network transforms the result. The key advantage over LSTMs: position 100 can directly attend to position 1 in a single step, whereas an LSTM must propagate information through 99 sequential hidden states — suffering from the vanishing gradient problem along the way.
Where x is the one-hot token vector, We is the token embedding matrix (shared with the output), and Wp is a learned position embedding.
The self-attention mechanism in each layer computes:
Where Q (queries), K (keys), and V (values) are linear projections of the input. In GPT, the attention scores are masked so that position t can only attend to positions 1 through t. This mask is a lower-triangular matrix applied before the softmax, ensuring that future tokens have zero attention weight.
python # GPT's causal (unidirectional) attention import torch def causal_attention(Q, K, V, d_k): """Masked self-attention: each position sees only past positions""" scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5) # Create causal mask: upper triangle = -infinity mask = torch.triu(torch.ones(scores.size()), diagonal=1) scores = scores.masked_fill(mask == 1, float('-inf')) weights = torch.nn.functional.softmax(scores, dim=-1) return weights @ V # Position 5 can attend to positions 1-5 # Position 5 CANNOT attend to positions 6-T
GPT was pre-trained on the BooksCorpus dataset (~7,000 unpublished books, ~1 billion words). This is significantly more data than ELMo's training corpus, and the books provide long-range coherent text that teaches the model about narrative structure, logical reasoning, and extended discourse — capabilities that short web snippets don't develop.
The fine-tuning procedure is conceptually simple: take the pre-trained GPT, add a single linear classification layer on top, and train the entire stack (pre-trained weights + new classification head) end-to-end with a small learning rate (~5e-5). The small learning rate is crucial — it ensures the pre-trained weights change only slightly, preserving the general language understanding while adapting to the specific task.
python # GPT fine-tuning for sentiment classification # 1. Pre-trained GPT: 12 Transformer layers, 117M params # 2. Add linear head: W × h_last → num_classes # 3. Fine-tune ALL weights with lr=5e-5 for 3 epochs # Forward pass during fine-tuning: # tokens = ["The", "movie", "was", "great", "<cls>"] # h = GPT(tokens) # [5, 768] — contextual representations # logits = W @ h[-1] # use last token's representation # loss = cross_entropy(logits, label) # "positive" # loss.backward() # gradients flow through ALL 117M params
Smith emphasizes a subtle point: the fine-tuning paradigm is a form of transfer learning that was already well-established in computer vision (ImageNet pre-training → task-specific fine-tuning). GPT brought this paradigm to NLP, showing that language models pre-trained on raw text learn representations that transfer to diverse downstream tasks — even tasks very different from language modeling.
GPT's representations are contextual (they depend on surrounding words), but only on the left context. When processing position t, GPT can only attend to positions 1 through t. It cannot see the future — the tokens to the right are masked out.
This is a design choice driven by the training objective: standard left-to-right language modeling requires this constraint. If the model could see future tokens, predicting the next token would be trivial (just look at it). The mask prevents "cheating" during training.
But this unidirectionality is a limitation for understanding tasks. When classifying the sentiment of "The movie was not very good, but the acting was incredible," a left-to-right model processing "not" doesn't yet know that "good" is coming — let alone "but the acting was incredible." A bidirectional model sees the whole sentence at once.
Smith provides a compelling thought experiment: consider the sentence "The old man the ships." When you read left-to-right, "old" seems like an adjective modifying "man." But "man" is actually a verb (meaning "to operate"), and "old" is a noun (the elderly). You need the end of the sentence ("the ships") to realize this — but a left-to-right model has already committed to its representation of "old" and "man" before seeing "ships." A bidirectional model can use the full sentence to correctly parse these garden-path sentences.
| Aspect | ELMo | GPT |
|---|---|---|
| Architecture | 2-layer biLSTM | 12-layer Transformer decoder |
| Direction | Shallow bidirectional | Unidirectional (left-to-right) |
| Parameters | 94M | 117M |
| Transfer method | Feature extraction (frozen) | Fine-tuning (all params) |
| Training data | 1B words (1B Word Benchmark) | ~5B words (BooksCorpus + similar) |
| Context window | Entire sentence (LSTM memory) | 512 tokens (fixed window) |
Click on a token to see which positions GPT can attend to. The causal mask ensures each token only sees itself and previous tokens — never future ones. Compare with the bidirectional view.
In October 2018, Devlin et al. combined the best of both worlds: the Transformer architecture from GPT with the bidirectionality from ELMo — but made the bidirectionality deep instead of shallow. The result was BERT, and it destroyed every benchmark.
Smith frames BERT as the logical culmination of two independent threads: ELMo's bidirectionality and GPT's Transformer architecture + fine-tuning. Each had a strength the other lacked. ELMo saw both directions but used an LSTM (sequential, limited memory) and froze weights during transfer. GPT used the powerful Transformer architecture and fine-tuned all weights but could only see left context. BERT combined both: Transformer encoder (parallel, deep attention) + bidirectional context + fine-tuning.
But combining them required solving a fundamental problem: how do you train a bidirectional Transformer? You can't use standard language modeling (predict the next word) because the model would simply look ahead through the unrestricted attention and read the answer. This is the "information leak" problem.
BERT's key innovation is the Masked Language Model (MLM) objective. Instead of predicting the next word (which requires left-to-right masking), BERT randomly masks 15% of tokens and predicts them from the full bidirectional context. This allows the Transformer encoder to use unrestricted self-attention — every token can attend to every other token at every layer.
The masking procedure is more nuanced than simply replacing words with [MASK]. Of the 15% selected tokens: 80% are replaced with [MASK], 10% are replaced with a random word, and 10% are left unchanged. The 10% random replacement prevents the model from only learning to predict [MASK] tokens. The 10% unchanged tokens force the model to maintain good representations even for tokens that aren't masked — it never knows which tokens it might be asked to predict.
The model sees: "The [MASK] sat on the mat" and must predict "cat." It can use both left context ("The") and right context ("sat on the mat") simultaneously, at every layer. This is deep bidirectionality — fundamentally more powerful than ELMo's concatenation of independent directions.
Let's be precise about the "information leak" problem. Consider a bidirectional model trying to predict word t. If every token can attend to every other token (unrestricted attention), then the model at position t can simply look at position t and read the answer — the word itself. The training signal would be trivial (just copy the input), and the model would learn nothing useful. ELMo avoids this by processing each direction independently — the forward LSTM at position t has never seen position t's output from the backward LSTM. BERT avoids this by replacing the target word with [MASK] — even though attention is unrestricted, the answer has been removed.
BERT also introduced a second pre-training objective: Next Sentence Prediction (NSP). Given two sentences, predict whether the second actually follows the first in the original text. This was intended to teach the model about inter-sentence relationships. Interestingly, later work (RoBERTa, 2019) showed that NSP doesn't help — and may actually hurt performance. The MLM objective alone is sufficient for learning powerful contextual representations. Smith mentions this as an area of active investigation.
| Aspect | ELMo | GPT | BERT |
|---|---|---|---|
| Architecture | biLSTM | Transformer decoder | Transformer encoder |
| Direction | Shallow bidir | Unidirectional | Deep bidir |
| Objective | Bidir LM (concat) | Left-to-right LM | MLM + NSP |
| Transfer | Feature extraction | Fine-tuning | Fine-tuning |
| Parameters | 94M | 117M | 110M / 340M |
| Context at each layer | Left OR right | Left only | Left AND right |
Compare how the representation of "bank" changes across the three paradigms. Static: one fixed vector. ELMo: different per context, but left/right don't interact during encoding. BERT: fully context-dependent at every layer.
BERT comes in two sizes, allowing researchers to trade off between performance and compute:
| Config | Layers | Hidden | Attention Heads | Parameters |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
BERT-Base is deliberately designed to match GPT's architecture (12 layers, 768 hidden, 110M parameters) — making their comparison a clean test of training objective (MLM vs left-to-right LM) rather than model size.
Pre-training data is also larger than GPT's: BERT uses BooksCorpus (800M words) + English Wikipedia (2,500M words) = ~3.3 billion words total. This is roughly 3x GPT's training data, giving BERT a knowledge advantage on top of its architectural advantage.
BERT's improvement over GPT and ELMo was not marginal — it was decisive:
| Benchmark | ELMo | GPT | BERT-Large |
|---|---|---|---|
| GLUE score | 68.6 | 72.8 | 80.5 |
| SQuAD 1.1 (F1) | 85.8 | 89.0 | 93.2 |
| MNLI (acc) | 76.4 | 82.1 | 86.7 |
The gap between BERT and GPT (same architecture family, different training objective) is larger than the gap between GPT and ELMo (different architecture, different training objective). This strongly suggests that the bidirectional training objective matters more than the specific architecture choice.
Smith makes a nuanced observation here: BERT's success doesn't prove that bidirectionality is always better. For understanding tasks (classification, QA, NLI), bidirectionality is clearly superior — the model needs to see the whole input. But for generation tasks (text completion, translation, summarization), left-to-right models have an inherent advantage because they can generate text autoregressively. This distinction between "understanding models" and "generation models" would persist for years, until GPT-3 showed that sufficiently large unidirectional models could do both.
python # BERT fine-tuning: adding a task-specific head from transformers import BertForSequenceClassification # Sentiment classification example model = BertForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2 # positive / negative ) # Architecture: # Input → BERT encoder (12 layers) → [CLS] token → Linear(768, 2) # Fine-tune ALL weights for 3 epochs with lr=2e-5 # The [CLS] token's representation captures the entire sentence # because self-attention lets it attend to ALL other tokens # bidirectionally — this is why BERT excels at classification
Smith's paper carefully distinguishes two ways to use pre-trained contextual representations. This distinction, seemingly minor, has major practical implications.
Run the pre-trained model on your input and extract the hidden states. Use these as fixed features — feed them into your own downstream model without modifying the pre-trained weights.
Initialize with pre-trained weights, add a thin task-specific layer, and train the entire model end-to-end with a small learning rate.
Let's make this concrete with numbers. For a sentiment classification task with 10K training examples:
python # Feature extraction approach # 1. Run BERT once on all 10K examples (fixed cost) # 2. Cache the [CLS] vectors: 10K × 768 floats = 30 MB # 3. Train a linear classifier on cached features # Total GPU time: ~5 min (BERT inference) + ~1 min (classifier) # Accuracy: ~91% # Fine-tuning approach # 1. Initialize BERT + linear head # 2. Train end-to-end for 3 epochs # Total GPU time: ~30 min (full backprop through BERT) # Accuracy: ~93% # The 2% gap seems small, but on leaderboards it's the # difference between state-of-the-art and "also participated"
Smith also discusses an intermediate approach: gradual unfreezing. Instead of fine-tuning all layers at once, you start by training only the task head, then progressively unfreeze deeper layers. This reduces the risk of catastrophic forgetting while still allowing the representations to adapt. Howard and Ruder (2018) showed this approach works well in practice, though it adds hyperparameter complexity.
A later approach that became dominant is adapter layers (Houlsby et al., 2019). Instead of fine-tuning all parameters, you insert small trainable "adapter" modules between the frozen pre-trained layers. Only the adapters (typically 1-5% of total parameters) are trained. This gives fine-tuning-level performance with feature-extraction-level storage efficiency (one frozen model + many small adapters, instead of many full model copies).
python # Three transfer learning approaches compared # 1. Feature extraction (ELMo-style) # Trainable params: ~100K (task head only) # Storage: 1 shared model + 1 head per task # Accuracy: 91% # 2. Fine-tuning (BERT-style) # Trainable params: 110M (all BERT + head) # Storage: 1 full model copy per task # Accuracy: 93% # 3. Adapters (post-Smith, but natural evolution) # Trainable params: ~2M (adapter layers only) # Storage: 1 shared model + adapters per task # Accuracy: 92.5% # The field converged on fine-tuning for maximum performance, # but adapters/LoRA became standard when storage is a concern
The distinction between feature extraction and fine-tuning also connects to a deeper question Smith raises: what do pre-trained representations actually encode? If fine-tuning works better, it suggests the pre-trained representations are good but not perfect for any specific task — they need to be "adjusted" to the target domain. If feature extraction works well, it suggests the representations are already task-ready. In practice, the answer is somewhere in between: pre-trained representations encode general linguistic knowledge that is useful for many tasks, but fine-tuning can specialize them for maximum performance on any specific task.
Smith provides a helpful analogy: feature extraction is like asking an expert for advice but not letting them see the specific problem. Fine-tuning is like hiring the expert to work on your specific problem full-time. The expert's general knowledge is the same, but the latter produces better results because the expert can adapt their approach to the specific situation.
| Factor | Feature Extraction | Fine-Tuning |
|---|---|---|
| Best when | Many tasks share one model, compute is limited | Maximum performance on one task |
| Training cost | Low (only task head trains) | Medium (all params update) |
| Data needed | Very little (100s of examples) | Little (1000s of examples) |
| Risk | Under-adaptation | Catastrophic forgetting |
| Storage | One model, many heads | Full model copy per task |
Compare the two transfer learning strategies. In feature extraction, the pre-trained weights are frozen (gray). In fine-tuning, all weights are updated (colored). Watch how performance changes with training steps for each approach.
Smith discusses probing tasks — simple classifiers trained on frozen representations to test what linguistic information they encode. If a linear classifier can predict POS tags from layer 1 of BERT, then POS information is linearly decodable from that layer.
Probing studies revealed a remarkable finding: contextual representations encode a near-complete model of language structure, organized hierarchically by layer depth. The representations aren't just useful — they contain a systematic encoding of linguistic knowledge:
| Linguistic Property | Where Encoded | Probe Accuracy |
|---|---|---|
| Part-of-speech tags | Lower layers (1-4) | 97%+ |
| Dependency relations | Middle layers (4-8) | 90%+ |
| Semantic roles | Upper layers (6-10) | 85%+ |
| Coreference | Upper layers (8-12) | 80%+ |
| World knowledge | Distributed across layers | Varies widely |
The layer specialization finding is remarkable. Lower layers capture surface-level features (spelling, POS), middle layers capture syntactic structure (parse trees, dependencies), and upper layers capture task-relevant semantics (sentiment, entailment). This suggests that contextual representations build meaning progressively — from form to syntax to semantics — much like how linguistic theory organizes language into levels.
python # Probing experiment: what does each BERT layer know? import torch from transformers import BertModel model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True) def probe_layer(layer_idx, task_data, task_labels): """Train a linear classifier on frozen BERT representations""" features = [] for text in task_data: tokens = tokenizer(text, return_tensors='pt') with torch.no_grad(): outputs = model(**tokens) # Extract representations from specific layer h = outputs.hidden_states[layer_idx] # [1, seq_len, 768] features.append(h.mean(dim=1)) # average pool # Train linear probe: features → task_labels # Accuracy tells us how much info layer_idx encodes # Result: POS tagging peaks at layer 1-2 # Result: Dependency parsing peaks at layer 4-6 # Result: Sentiment peaks at layer 10-12
Smith also discusses the probing controversy. Some researchers argue that high probing accuracy doesn't prove the model "knows" a linguistic property — it might just be an artifact of the probe's capacity. A sufficiently complex probe could extract any information from any representation, even random ones. The community addressed this by using minimal probes (linear classifiers), control tasks (random baselines), and information-theoretic analyses to validate probing results.
Let's bring the whole progression together in one interactive simulation. Here you can see how the same word gets represented differently across the three eras: static (Word2Vec/GloVe), shallow contextual (ELMo), and deep contextual (BERT).
This is the key experiment that validates the entire paradigm shift. If contextual representations really capture meaning-in-context, then the same word in different contexts should have different representations — and the degree of difference should correlate with how different the meanings are. Static embeddings give similarity of 1.0 (identical vectors, by definition). ELMo reduces this to ~0.6 (some differentiation). BERT reduces it further to ~0.4 (substantial differentiation). The progression is clear and quantifiable.
Select a polysemous word and two different contexts. Watch how each paradigm represents the word: static puts it in one place regardless of context; ELMo gives different vectors but from independent left/right processing; BERT produces deeply contextual vectors where all context interacts. The similarity scores show how well each paradigm distinguishes meanings.
The key quantitative finding: cosine similarity between the same word in different contexts:
| Model | "bank" (finance vs river) | "bat" (animal vs sport) | "spring" (season vs coil) |
|---|---|---|---|
| Static (GloVe) | 1.00 (identical) | 1.00 (identical) | 1.00 (identical) |
| ELMo | 0.65 | 0.58 | 0.72 |
| BERT layer 12 | 0.42 | 0.35 | 0.48 |
Ethayarajh (2019) studied the geometry of contextual representations and found something surprising: as you go deeper into BERT, the representations become more anisotropic — they occupy a narrow cone in the high-dimensional space rather than being uniformly distributed. This means contextual representations aren't using the full capacity of the vector space. Later work (whitening, isotropy correction) addressed this geometric inefficiency.
The anisotropy finding has practical implications. If all representations cluster in a narrow cone, cosine similarity between any two words tends to be high (~0.5-0.7) regardless of meaning. This makes it harder to distinguish similar from dissimilar words using raw cosine similarity. Post-processing techniques like mean centering and whitening can mitigate this, pushing the representations to use more of the available space.
python # Comparing representations across eras import torch from transformers import BertModel, BertTokenizer model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') def get_word_vec(sentence, word_idx, layer): tokens = tokenizer(sentence, return_tensors='pt') with torch.no_grad(): outputs = model(**tokens) return outputs.hidden_states[layer][0, word_idx] # "bank" in financial vs river context v1 = get_word_vec("I deposited money at the bank", 6, 12) v2 = get_word_vec("I sat on the river bank", 6, 12) sim = torch.nn.functional.cosine_similarity(v1, v2, dim=0) print(f"BERT layer 12 similarity: {sim:.3f}") # ~0.42 # Layer 0 (mostly static-like) v1_0 = get_word_vec("I deposited money at the bank", 6, 0) v2_0 = get_word_vec("I sat on the river bank", 6, 0) sim_0 = torch.nn.functional.cosine_similarity(v1_0, v2_0, dim=0) print(f"BERT layer 0 similarity: {sim_0:.3f}") # ~0.95
Smith's survey captured a moment of fundamental transition in NLP. The shift from static to contextual representations was not just a technical improvement — it changed what we mean by "word meaning" in computational linguistics.
The shift from static to contextual representations happened astonishingly fast — just five years from Word2Vec (2013) to BERT (2018). In that time, the field went from "words as fixed points" to "words as functions of context," fundamentally changing how NLP systems represent meaning.
| Era | Model | Key Idea | Representation |
|---|---|---|---|
| 2013 | Word2Vec | Predict context from words | One vector per word type |
| 2014 | GloVe | Factorize co-occurrence matrix | One vector per word type |
| 2018 | ELMo | Bidirectional LSTM LM | One vector per word token (shallow bidir) |
| 2018 | GPT | Transformer LM + fine-tune | One vector per word token (unidirectional) |
| 2018 | BERT | Masked LM + fine-tune | One vector per word token (deep bidir) |
Smith's paper was written in early 2019, just months after BERT's release. The field was already moving fast. Here's how the landscape evolved in the years that followed:
| Development | Contribution |
|---|---|
| RoBERTa (2019) | Showed BERT was undertrained — more data, longer training, no NSP improved results |
| GPT-2/3 (2019/2020) | Showed that scale (1.5B → 175B params) makes unidirectional models competitive with bidirectional |
| T5 (2020) | Unified all NLP as text-to-text with an encoder-decoder Transformer |
| Llama 3 (2024) | Open-weight models at 8-405B scale with 15T tokens of training data |
| BERT → ChatGPT (2022) | The contextual representation paradigm, scaled to 175B+ parameters + RLHF, enables conversational AI |
The most surprising development after Smith's paper was the emergence of in-context learning in GPT-3 (Brown et al., 2020). It turned out that sufficiently large unidirectional models (175B parameters) could perform tasks without any fine-tuning at all — just by seeing a few examples in the prompt. This is contextual representation taken to the extreme: the entire prompt (including task examples) becomes the "context" that shapes the model's representations and outputs.
python # The evolution of "how to use pre-trained representations" # 2018 (ELMo): Freeze model, extract features # embed = elmo(text) # fixed features # classifier = train_new_model(embed, labels) # 2018 (BERT): Fine-tune entire model # model = BERT + linear_head # model = train(model, task_data, lr=2e-5) # 2020 (GPT-3): No training at all — in-context learning # prompt = "Classify sentiment:\n" # + "Great movie! → positive\n" # + "Terrible plot → negative\n" # + "Amazing cast → " # answer = gpt3.generate(prompt) # "positive" # 2023 (ChatGPT/Claude): Natural conversation # "Is 'The Godfather' a good movie?" → detailed review
Smith was prescient about several limitations that remain relevant even in 2024:
Each of these limitations has spawned entire research subfields. Efficient Transformers (Tay et al., 2020) address the compute cost. BERTology (Rogers et al., 2020) addresses interpretability. Dynamic benchmarks (e.g., Dynabench) address the evaluation gap. Fairness-aware pre-training addresses bias. Smith's paper correctly identified the frontier of research that would occupy the field for years to come.
Perhaps the deepest contribution of Smith's paper is articulating the philosophical shift. In the static embedding era, meaning was treated as a property of word types — "bank" has meaning X, regardless of context. In the contextual era, meaning is treated as a property of word tokens — "bank" in "river bank" has meaning Y, while "bank" in "money bank" has meaning Z. This is a fundamental shift in how computational linguistics models language, and it aligns with how linguists have always understood meaning: you can't define a word's meaning without considering its context of use.
The field has since moved far beyond the models Smith surveyed, but the fundamental insight of his paper remains: meaning is contextual. A word's meaning is not fixed; it's constructed in real-time from the words around it. This shift from type-level to token-level representations was the conceptual foundation for everything that followed — GPT-3, ChatGPT, Claude, and the entire large language model revolution.
Looking back from 2024, it's clear that Smith identified the right paradigm shift at the right moment. The field's trajectory since then — ever-larger pre-trained models, ever more sophisticated fine-tuning techniques, and the emergence of in-context learning — all build on the foundation of contextual representations. GPT-3's ability to do "few-shot learning" (give it a few examples in the prompt) is essentially the contextual representation paradigm taken to its extreme: the model's representation of each word depends on the entire prompt, including the task examples.
The paper also presciently warned about the environmental and social costs of large-scale pre-training, a concern that would become much more prominent in subsequent years with the work of Bender et al. (2021) and others. The tension between model capability and compute cost remains one of the central challenges in modern NLP.
python # The full progression: from one-hot to contextual # 1990s: one-hot encoding # "cat" = [0, 0, 0, ..., 1, ..., 0] # V-dimensional sparse # similarity("cat", "dog") = 0 # no notion of similarity # 2013: Word2Vec/GloVe (static embeddings) # "cat" = [0.23, -0.11, 0.87, ...] # 300-dimensional dense # similarity("cat", "dog") = 0.76 # captures semantics # but: "bank" always the same vector, regardless of context # 2018: ELMo (shallow contextual) # "bank" in "river bank" ≠ "bank" in "money bank" # similarity between them: 0.65 (partially differentiated) # 2018: BERT (deep contextual) # "bank" representations fully context-dependent # similarity between meanings: 0.42 (well differentiated) # Each layer adds more context → more differentiation
"You shall know a word by the company it keeps."
— John Rupert Firth, 1957