Noah A. Smith (University of Washington / Allen Institute for AI) — ArXiv 2019

Contextual Word Representations

A Contextual Introduction — tracing the paradigm shift from static word vectors (Word2Vec, GloVe) to context-dependent representations (ELMo, GPT, BERT). Why one vector per word isn't enough.

Prerequisites: Vectors / dot products + Neural network basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Polysemy Problem

Consider the word "bank." In English, it has at least these meanings:

SentenceMeaning of "bank"
"I deposited money at the bank."Financial institution
"She sat on the river bank."Edge of a river
"The pilot had to bank the plane."Tilt during a turn
"Don't bank on it."Rely on / count on

Same spelling, same pronunciation, completely different meanings. This phenomenon is called polysemy — a single word form carrying multiple related (or unrelated) meanings. Polysemy is not rare; it's the norm. The average English word has 2-3 dictionary senses. Common words have far more: "run" has over 600 senses in the OED.

Humans handle polysemy effortlessly. When you read "I deposited money at the bank," you instantly know "bank" means a financial institution — you don't even consider the river meaning. How? Because the surrounding words ("deposited," "money") activate the financial sense and suppress the others. Your brain constructs the meaning of "bank" in context, not in isolation.

But computers don't have this ability by default. Traditional NLP systems represent each word as a fixed vector or a dictionary entry. They can look up "bank" and find a list of possible meanings, but they can't automatically choose the right one for a given sentence without additional disambiguation logic. The contextual representation revolution — the subject of Smith's paper — gives computers the same ability humans have: the meaning of a word is computed dynamically from its context.

Now here's the problem for NLP: if you represent each word as a single fixed vector (as Word2Vec and GloVe do), where does "bank" go in vector space? Near "money" and "finance"? Near "river" and "shore"? Near "tilt" and "angle"? It can't be near all of them simultaneously — those regions of vector space are far apart. The static embedding must compromise, placing "bank" somewhere in between, equally bad at representing all its senses.

This isn't just a theoretical concern. Consider building a question-answering system. A user asks: "What did the pilot do at the bank?" With static embeddings, "bank" has one representation that mixes financial and river senses. The system might retrieve answers about banking transactions when the user meant a river bank. Without context, there's no way to disambiguate. The word alone isn't enough — you need the sentence.

The scale of the problem is staggering. Zipf's law tells us that the most frequent words are the most polysemous. The top 100 most common English words average 10+ senses each. Words like "run," "set," "get," "take," and "make" each have dozens of distinct meanings. These are the words NLP systems encounter most often — and they're exactly the words that static embeddings handle worst.

The Polysemy Problem

In a static embedding space, "bank" gets ONE fixed position (orange dot) that compromises between its multiple meanings. Contextual embeddings give "bank" a DIFFERENT position depending on the sentence. Click the sentences to see how a contextual model would position "bank" differently for each meaning.

The fundamental limitation of static embeddings: One word = one vector. But one word = many meanings. No single point in vector space can faithfully represent a polysemous word. You need the word's vector to change based on the surrounding context. This is the core motivation for contextual word representations.

Smith's paper surveys the field's progression from static to contextual representations — a shift that he argues is "the most significant empirical advance in NLP in the past decade." This isn't just an incremental improvement. It's a fundamental change in how we think about meaning: meaning is not a property of words, but of words in context.

The paper is particularly valuable because Smith writes as a linguist-turned-ML-researcher. He doesn't just describe the models — he explains why the shift was necessary, grounding it in linguistic theory. The distributional hypothesis ("you shall know a word by the company it keeps" — Firth, 1957) underlies both static and contextual embeddings. But static embeddings implement it incompletely: they capture what company a word typically keeps, while contextual embeddings capture what company it keeps right now.

The progression from Word2Vec (2013) through ELMo (2018) to BERT (2018) happened in just five years — a remarkably fast paradigm shift for a field. Each step was motivated by a clear limitation of the previous approach, and each step produced immediate, measurable improvements on downstream tasks.

2013: Static Embeddings
Word2Vec, GloVe — one vector per word type. Fixed regardless of context.
2018 Feb: ELMo
Bidirectional LSTM — one vector per word token. Context from both sides, but independently.
2018 Jun: GPT
Transformer decoder — contextual but left-to-right only. Fine-tuning paradigm.
2018 Oct: BERT
Transformer encoder + MLM — deep bidirectional. Fine-tuning. State-of-the-art on everything.

Let's trace this progression, starting from the static world.

Why can't a static word embedding (like Word2Vec) adequately represent the word "bank"?

Chapter 1: Static Embeddings

Before we can appreciate contextual representations, we need to understand what they replaced. The static embedding era (2013-2018) was built on a deceptively simple idea: words that appear in similar contexts should have similar meanings. This is the distributional hypothesis (Harris 1954, Firth 1957).

Word2Vec (Mikolov et al., 2013)

Word2Vec learns a d-dimensional vector for each word by training a shallow neural network to predict context words from target words (Skip-gram) or target words from context (CBOW). The key insight: the hidden weights of this network, after training, are the word embeddings.

p(wcontext | wtarget) = softmax(vcontextT vtarget)

After training on billions of words, the resulting vectors exhibit remarkable algebraic properties:

vking - vman + vwoman ≈ vqueen

This means the vector space encodes semantic relationships as directions: the "gender" direction (man→woman) can be applied to "king" to get "queen." Similar arithmetic works for country-capital, tense, plural, and many other relationships.

Let's be precise about how this works. The "gender" direction in the vector space is approximately vwoman - vman. When you add this direction to vking, you move from the "male ruler" region to the "female ruler" region, arriving near vqueen. The fact that such regular semantic structures emerge from a simple prediction task — without any explicit semantic supervision — was the breakthrough that launched the embedding era.

python
# Word analogies: king - man + woman ≈ queen
from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors.bin', binary=True)

# Analogy: king : man :: ? : woman
result = wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)
# [('queen', 0.71), ('monarch', 0.62), ('princess', 0.59)]

# This works because relationships are encoded as DIRECTIONS:
# v_king - v_man ≈ v_queen - v_woman  (gender direction)
# v_paris - v_france ≈ v_tokyo - v_japan  (capital direction)
# v_bigger - v_big ≈ v_faster - v_fast  (comparative direction)

GloVe (Pennington et al., 2014)

GloVe approaches the same goal from a different angle: instead of prediction, it directly factorizes the global word co-occurrence matrix. If words i and j co-occur frequently, their dot product should be high:

viT vj + bi + bj ≈ log(Xij)

Where Xij counts how often word i appears near word j in the training corpus. GloVe produces embeddings of similar quality to Word2Vec but with the advantage that the optimization objective is convex (no local minima).

The co-occurrence matrix X is typically built from a large corpus (Wikipedia + Gigaword, ~6 billion tokens). Each entry Xij counts how many times word i appears within a context window (usually 10 words) of word j. Words that frequently co-occur (like "ice" and "cream") get high values. The GloVe training procedure then finds vectors whose dot products approximate the log of these counts.

A concrete example helps build intuition. The word "France" co-occurs frequently with "Paris," "wine," "Eiffel," and "language." The word "Japan" co-occurs frequently with "Tokyo," "sushi," "emperor," and "language." Because "France" and "Japan" share many co-occurrence patterns (both co-occur with "language," "culture," "country," etc.), their vectors end up close together in the embedding space — both in the "country" region. But because they also have distinct co-occurrence patterns (France with "wine," Japan with "sushi"), they're close but not identical.

Both Word2Vec and GloVe produce excellent vectors for unambiguous words. "Paris" is reliably close to "France" and "city." "Running" is close to "jogging" and "exercise." The problem only emerges with polysemous words — which happen to be the most common and important words in the language.

There were attempts to address polysemy within the static embedding framework. Multi-sense embeddings (Reisinger and Mooney, 2010; Neelakantan et al., 2014) learned multiple vectors per word — one for each sense. But this required pre-specifying the number of senses (how many senses does "run" have? 3? 10? 600?) and didn't capture the smooth continuum of meaning that words exhibit in practice. A word's meaning doesn't switch discretely between a fixed set of senses — it varies continuously with context.

Another limitation of static embeddings that Smith highlights is the out-of-vocabulary (OOV) problem. Static embeddings have a fixed vocabulary learned during training. Any word not in the vocabulary (misspellings, rare technical terms, neologisms) gets no representation at all. Contextual models handle this through subword tokenization — breaking unknown words into known pieces — and through the contextual nature of the representations themselves.

The three fatal limitations

Smith identifies three fundamental limitations of static embeddings that motivated the move to contextual representations. Each limitation corresponds to an aspect of meaning that static embeddings fundamentally cannot capture:

LimitationExampleConsequence
No polysemy handling"bank" gets one vector regardless of contextAmbiguous words are poorly represented
No compositionality"hot dog" ≠ vhot + vdogMulti-word expressions lose meaning
No syntactic sensitivity"dog bites man" vs "man bites dog" have same word vectorsWord order information is lost
Static Embedding Space

Explore a simplified 2D embedding space. Each word is a fixed point. Notice how "bank" sits uncomfortably between the financial and nature clusters. Drag the slider to see how close "bank" is to different word clusters.

Rotate view
python
# Static embedding: "bank" gets ONE vector
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-300")

# Same vector regardless of context
bank_vec = model["bank"]  # shape: (300,) — always the same

# Check similarity to different meanings
print(model.similarity("bank", "money"))   # 0.43 — some financial
print(model.similarity("bank", "river"))   # 0.36 — some nature
print(model.similarity("bank", "tilt"))    # 0.12 — almost nothing
# The static embedding compromises: moderate similarity to
# financial terms, moderate to nature, low to aviation

Despite these limitations, static embeddings were transformative. Before Word2Vec, NLP systems used one-hot encodings — 50,000-dimensional sparse vectors with no notion of similarity. Static embeddings compressed this to 300 dense dimensions where similarity was meaningful. They just couldn't handle the fact that meaning depends on context.

To appreciate the magnitude of this shift: before Word2Vec, the standard approach to representing "cat" and "dog" gave them zero similarity (different one-hot indices). After Word2Vec, "cat" and "dog" had cosine similarity of ~0.76 — correctly reflecting their semantic relatedness. This was revolutionary for every NLP task: sentiment analysis, machine translation, question answering, and more.

The specific training procedure for Word2Vec Skip-gram works as follows: for each word in the training corpus, take a window of surrounding words (typically 5 on each side). The model learns to predict these context words from the target word. After training on billions of word pairs, the hidden weights of the prediction network become the word vectors.

python
# Word2Vec: training mechanics
# Input: "The cat sat on the mat"
# For target "cat" with window=2:
#   Training pairs: (cat, The), (cat, sat), (cat, on)
# The model learns: P(context | target) = sigmoid(v_context · v_target)

import gensim
model = gensim.models.Word2Vec(sentences, vector_size=300, window=5)

# The resulting vectors capture semantic relationships
# model.wv.most_similar("king", topn=5)
# → [("queen", 0.71), ("prince", 0.68), ("monarch", 0.66), ...]

# But ALL occurrences of "bank" contribute to ONE vector:
# "money bank" + "river bank" + "bank shot" → compromised average
The key insight from Smith: Static embeddings solve the "similarity problem" (words that mean similar things should have similar vectors) but not the "identity problem" (the same word can mean different things). Contextual embeddings solve both. This distinction is fundamental because the identity problem affects the most frequent — and therefore most important — words in the language.
What is the distributional hypothesis, and how do Word2Vec and GloVe use it?

Chapter 2: ELMo — The First Contextual Representations

In February 2018, Peters et al. introduced ELMo (Embeddings from Language Models), and contextual word representations were born. The idea was elegant: instead of learning one vector per word type, learn a function that produces a vector for each word token — a different vector depending on the surrounding sentence.

The core insight was hiding in plain sight: language models already produce contextual representations. An LSTM language model, as it processes a sentence word by word, builds up a hidden state that depends on all previous words. This hidden state is already a context-dependent representation of the current position. Why not use it as a word embedding?

The answer to "why didn't anyone think of this sooner?" is partly computational. In 2013-2017, language models were small and trained on limited data. Their hidden states weren't rich enough to be useful as general-purpose embeddings. By 2018, language models had grown large enough (93M parameters, trained on 1 billion words) that their internal representations had become genuinely useful — encoding syntax, semantics, and world knowledge.

Architecture: Bidirectional LSTM

The name ELMo — Embeddings from Language Models — captures the key idea: use a pre-trained language model as an embedding function. The language model is trained to predict the next word in a sequence (the same objective used in GPT, but with an LSTM instead of a Transformer). After training, the model's internal hidden states become the contextual embeddings.

ELMo uses a two-layer bidirectional LSTM trained as a language model:

Forward LSTM
Reads left → right. At position t, predicts word t+1 from words 1...t. Captures left context.
+
Backward LSTM
Reads right → left. At position t, predicts word t-1 from words T...t. Captures right context.
Combine
Concatenate forward and backward hidden states at each position. Each token now has a context-dependent representation.

The training objective is the joint log-likelihood of both directions:

L = ∑t=1T [ log p(wt | w1,...,wt-1) + log p(wt | wt+1,...,wT) ]

Each direction is a standard language model. The forward LSTM produces hidden states ht capturing left context; the backward LSTM produces ht capturing right context. The concatenation [ht; ht] captures both sides.

Let's walk through the dimensions concretely. ELMo uses a 2-layer LSTM with 4096 hidden units per direction, projected down to 512 dimensions. At each token position t, the forward LSTM produces a 512-dimensional vector ht, and the backward LSTM produces a 512-dimensional vector ht. Concatenation gives a 1024-dimensional vector at each of 3 levels (character embedding + 2 LSTM layers). The final ELMo representation is a learned weighted sum across all 3 levels.

python
# ELMo internals: dimensions at each level
# Level 0 (character CNN): projects characters → 512-dim
#   Input: character ids for each word
#   Output: h_0 ∈ R^512 per token

# Level 1 (LSTM layer 1):
#   Forward:  h_1_fwd ∈ R^512 per token (left context)
#   Backward: h_1_bwd ∈ R^512 per token (right context)
#   Concat:   h_1 = [h_1_fwd; h_1_bwd] ∈ R^1024

# Level 2 (LSTM layer 2):
#   Forward:  h_2_fwd ∈ R^512 per token
#   Backward: h_2_bwd ∈ R^512 per token
#   Concat:   h_2 = [h_2_fwd; h_2_bwd] ∈ R^1024

# Final ELMo: γ * (s_0*h_0 + s_1*h_1 + s_2*h_2)
# where s_j are softmax-normalized per-task weights
# and γ is a per-task scalar
ELMo's key innovation: using ALL layers. Previous work used only the top-layer hidden states. ELMo showed that different layers capture different information — layer 1 captures syntax (POS tags, parse structure), layer 2 captures semantics (word sense, NER). The final ELMo representation is a learned weighted sum across all layers, letting each downstream task choose which layer to emphasize.
ELMot = γ ∑j=0L sj ht,j

Where sj are softmax-normalized layer weights (learned during fine-tuning, not pre-training), γ is a scalar, and ht,j is the concatenated hidden state at layer j. This gives each task a different view of the representations.

python
# ELMo: different vectors for "bank" in different contexts
from allennlp.modules.elmo import Elmo

elmo = Elmo(options_file, weight_file, num_output_representations=1)

sent1 = ["I", "went", "to", "the", "bank", "to", "deposit", "money"]
sent2 = ["I", "sat", "by", "the", "river", "bank"]

# "bank" in sent1: shape (1024,) — financial context
# "bank" in sent2: shape (1024,) — river context
# These are DIFFERENT vectors! cosine similarity ≈ 0.6
# (vs 1.0 for static embeddings, which are identical)

Why "shallow" bidirectionality?

ELMo's two LSTMs are trained independently — the forward LSTM never sees the backward LSTM's outputs, and vice versa. They're simply concatenated after the fact. This means at each layer, the representation of a word is based on either left context or right context, never both simultaneously. Information from both directions only meets at the concatenation point.

This is fundamentally different from BERT, where self-attention lets each token attend to all other tokens at every layer. BERT's bidirectionality is deep — left and right context interact at every layer. ELMo's is shallow — they interact only at the final concatenation.

To see why this matters, consider the sentence: "She put the book on the bank of the river." The word "bank" needs both "book" (left context) and "river" (right context) to be properly understood. In ELMo, the forward LSTM sees "She put the book on the bank" and doesn't yet know about "river." The backward LSTM sees "river the of bank the on" (processing right to left) and doesn't know about "book." Each direction has partial information. Only at the concatenation do both signals combine — but they never interact during processing. In BERT, when computing "bank"'s representation, the self-attention mechanism simultaneously considers both "book" and "river" (and every other word), allowing these context signals to interact and refine each other through multiple layers.

ELMo Architecture: Bidirectional LSTM

Watch how ELMo processes a sentence through forward (teal) and backward (orange) LSTMs. At each position, the two directions are concatenated to form the contextual representation. Click "Animate" to see the sequential processing.

Click to animate

Impact

ELMo improved state-of-the-art across six NLP benchmarks — not by a little, but by 6-25% relative error reduction. The improvements were largest on tasks requiring understanding of word sense (word sense disambiguation improved by 20%) and syntax (dependency parsing improved by 11%). This confirmed that contextual representations capture information that static embeddings fundamentally cannot.

The most striking result was on the word sense disambiguation task. Static embeddings had plateaued at ~68% accuracy — giving "bank" one vector means the model guesses randomly between financial and river senses. ELMo jumped to ~88% — because it gives "bank" different vectors in different contexts, the disambiguation is built into the representation itself.

python
# ELMo vs static embeddings: practical difference
# Task: classify "bank" as financial or river

# With static embeddings (GloVe):
# Input: v("bank") — same vector regardless of context
# → Model must rely entirely on OTHER features (nearby words)
# → Accuracy: ~68%

# With ELMo:
# Input: ELMo("bank" in "deposited money at the bank")
#        vs ELMo("bank" in "sat by the river bank")
# → These are DIFFERENT vectors — context is baked in
# → A simple linear classifier can distinguish them
# → Accuracy: ~88%

ELMo also revealed something profound about layer specialization in deep networks. The first LSTM layer primarily captures syntax — its representations are best for POS tagging and parsing. The second layer primarily captures semantics — its representations are best for word sense disambiguation and sentiment analysis. The weighted sum across layers lets each task pick its own emphasis, which is why ELMo improves diverse tasks simultaneously.

The "just concatenate" era: ELMo was typically used by simply concatenating its output vectors with existing task-specific features. This plug-and-play simplicity was a key driver of adoption — you didn't need to redesign your entire model. Just replace GloVe vectors with ELMo vectors and watch accuracy jump. The simplicity was both a strength (easy adoption) and a limitation (the task model couldn't influence the representations).
What does ELMo's "shallow bidirectionality" mean, and how is it different from BERT's approach?

Chapter 3: GPT — Transformers Enter the Game

Just a few months after ELMo, Radford et al. (2018) at OpenAI took a different approach. Instead of LSTMs, they used the Transformer decoder — and instead of feature extraction, they proposed fine-tuning as the primary transfer mechanism. This was GPT (Generative Pre-trained Transformer).

GPT's two key innovations relative to ELMo were architectural (Transformers instead of LSTMs) and methodological (fine-tuning instead of feature extraction). The Transformer architecture, introduced by Vaswani et al. in 2017, replaced the sequential processing of LSTMs with parallel self-attention. This was a game-changer for two reasons: (1) self-attention captures long-range dependencies more effectively than LSTM memory, and (2) the entire sequence can be processed in parallel during training, making it dramatically faster to train on large datasets.

Architecture: Transformer decoder stack

GPT uses 12 layers of Transformer decoder blocks. Each block contains masked self-attention (where each token can only attend to positions to its left) and a feed-forward network. The masking is crucial: it allows GPT to be trained as a language model (predict the next token) while using the Transformer's parallel training efficiency.

Each Transformer block processes the entire sequence in parallel. At each position, the self-attention mechanism computes a weighted average of all previous positions (masked to prevent looking ahead), and the feed-forward network transforms the result. The key advantage over LSTMs: position 100 can directly attend to position 1 in a single step, whereas an LSTM must propagate information through 99 sequential hidden states — suffering from the vanishing gradient problem along the way.

h0 = We x + Wp
hl = TransformerBlock(hl-1)    for l = 1, ..., 12
p(wt) = softmax(h12,t WeT)

Where x is the one-hot token vector, We is the token embedding matrix (shared with the output), and Wp is a learned position embedding.

The self-attention mechanism in each layer computes:

Attention(Q, K, V) = softmax(QKT / √dk) V

Where Q (queries), K (keys), and V (values) are linear projections of the input. In GPT, the attention scores are masked so that position t can only attend to positions 1 through t. This mask is a lower-triangular matrix applied before the softmax, ensuring that future tokens have zero attention weight.

python
# GPT's causal (unidirectional) attention
import torch

def causal_attention(Q, K, V, d_k):
    """Masked self-attention: each position sees only past positions"""
    scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)
    # Create causal mask: upper triangle = -infinity
    mask = torch.triu(torch.ones(scores.size()), diagonal=1)
    scores = scores.masked_fill(mask == 1, float('-inf'))
    weights = torch.nn.functional.softmax(scores, dim=-1)
    return weights @ V
    # Position 5 can attend to positions 1-5
    # Position 5 CANNOT attend to positions 6-T

GPT was pre-trained on the BooksCorpus dataset (~7,000 unpublished books, ~1 billion words). This is significantly more data than ELMo's training corpus, and the books provide long-range coherent text that teaches the model about narrative structure, logical reasoning, and extended discourse — capabilities that short web snippets don't develop.

GPT's key contribution was the pre-train → fine-tune paradigm. ELMo froze the language model and used its outputs as features. GPT fine-tuned the entire model for each downstream task, modifying all 117M parameters. This was riskier (you could catastrophically forget the pre-trained knowledge) but more powerful — the representations could adapt specifically to the task.

The fine-tuning procedure is conceptually simple: take the pre-trained GPT, add a single linear classification layer on top, and train the entire stack (pre-trained weights + new classification head) end-to-end with a small learning rate (~5e-5). The small learning rate is crucial — it ensures the pre-trained weights change only slightly, preserving the general language understanding while adapting to the specific task.

python
# GPT fine-tuning for sentiment classification
# 1. Pre-trained GPT: 12 Transformer layers, 117M params
# 2. Add linear head: W × h_last → num_classes
# 3. Fine-tune ALL weights with lr=5e-5 for 3 epochs

# Forward pass during fine-tuning:
# tokens = ["The", "movie", "was", "great", "<cls>"]
# h = GPT(tokens)         # [5, 768] — contextual representations
# logits = W @ h[-1]      # use last token's representation
# loss = cross_entropy(logits, label)  # "positive"
# loss.backward()         # gradients flow through ALL 117M params

Smith emphasizes a subtle point: the fine-tuning paradigm is a form of transfer learning that was already well-established in computer vision (ImageNet pre-training → task-specific fine-tuning). GPT brought this paradigm to NLP, showing that language models pre-trained on raw text learn representations that transfer to diverse downstream tasks — even tasks very different from language modeling.

The unidirectional limitation

GPT's representations are contextual (they depend on surrounding words), but only on the left context. When processing position t, GPT can only attend to positions 1 through t. It cannot see the future — the tokens to the right are masked out.

This is a design choice driven by the training objective: standard left-to-right language modeling requires this constraint. If the model could see future tokens, predicting the next token would be trivial (just look at it). The mask prevents "cheating" during training.

But this unidirectionality is a limitation for understanding tasks. When classifying the sentiment of "The movie was not very good, but the acting was incredible," a left-to-right model processing "not" doesn't yet know that "good" is coming — let alone "but the acting was incredible." A bidirectional model sees the whole sentence at once.

Smith provides a compelling thought experiment: consider the sentence "The old man the ships." When you read left-to-right, "old" seems like an adjective modifying "man." But "man" is actually a verb (meaning "to operate"), and "old" is a noun (the elderly). You need the end of the sentence ("the ships") to realize this — but a left-to-right model has already committed to its representation of "old" and "man" before seeing "ships." A bidirectional model can use the full sentence to correctly parse these garden-path sentences.

The generation-understanding tradeoff: GPT's unidirectionality is not a bug — it's a necessary design choice for a model that generates text. To generate the next word, you can't see future words (they don't exist yet). ELMo's bidirectionality is natural because ELMo only produces representations, never generates text. This tension between generation and understanding capabilities would become one of the central design debates in NLP.
AspectELMoGPT
Architecture2-layer biLSTM12-layer Transformer decoder
DirectionShallow bidirectionalUnidirectional (left-to-right)
Parameters94M117M
Transfer methodFeature extraction (frozen)Fine-tuning (all params)
Training data1B words (1B Word Benchmark)~5B words (BooksCorpus + similar)
Context windowEntire sentence (LSTM memory)512 tokens (fixed window)
GPT Unidirectional Attention

Click on a token to see which positions GPT can attend to. The causal mask ensures each token only sees itself and previous tokens — never future ones. Compare with the bidirectional view.

Click a word
What is the key difference between how ELMo and GPT transfer pre-trained knowledge to downstream tasks?

Chapter 4: BERT — Deep Bidirectionality

In October 2018, Devlin et al. combined the best of both worlds: the Transformer architecture from GPT with the bidirectionality from ELMo — but made the bidirectionality deep instead of shallow. The result was BERT, and it destroyed every benchmark.

Smith frames BERT as the logical culmination of two independent threads: ELMo's bidirectionality and GPT's Transformer architecture + fine-tuning. Each had a strength the other lacked. ELMo saw both directions but used an LSTM (sequential, limited memory) and froze weights during transfer. GPT used the powerful Transformer architecture and fine-tuned all weights but could only see left context. BERT combined both: Transformer encoder (parallel, deep attention) + bidirectional context + fine-tuning.

But combining them required solving a fundamental problem: how do you train a bidirectional Transformer? You can't use standard language modeling (predict the next word) because the model would simply look ahead through the unrestricted attention and read the answer. This is the "information leak" problem.

The masking trick

BERT's key innovation is the Masked Language Model (MLM) objective. Instead of predicting the next word (which requires left-to-right masking), BERT randomly masks 15% of tokens and predicts them from the full bidirectional context. This allows the Transformer encoder to use unrestricted self-attention — every token can attend to every other token at every layer.

The masking procedure is more nuanced than simply replacing words with [MASK]. Of the 15% selected tokens: 80% are replaced with [MASK], 10% are replaced with a random word, and 10% are left unchanged. The 10% random replacement prevents the model from only learning to predict [MASK] tokens. The 10% unchanged tokens force the model to maintain good representations even for tokens that aren't masked — it never knows which tokens it might be asked to predict.

LMLM = -∑i ∈ masked log p(xi | x\masked)

The model sees: "The [MASK] sat on the mat" and must predict "cat." It can use both left context ("The") and right context ("sat on the mat") simultaneously, at every layer. This is deep bidirectionality — fundamentally more powerful than ELMo's concatenation of independent directions.

Smith's key observation: The progression from ELMo to BERT is not just about better architectures — it's about overcoming the "information leak" problem. In standard language modeling, bidirectionality causes the answer to leak through attention. ELMo sidestepped this by keeping directions separate. BERT solved it directly: mask the answer, then use full bidirectional attention. The masking is the innovation, not the architecture.

Let's be precise about the "information leak" problem. Consider a bidirectional model trying to predict word t. If every token can attend to every other token (unrestricted attention), then the model at position t can simply look at position t and read the answer — the word itself. The training signal would be trivial (just copy the input), and the model would learn nothing useful. ELMo avoids this by processing each direction independently — the forward LSTM at position t has never seen position t's output from the backward LSTM. BERT avoids this by replacing the target word with [MASK] — even though attention is unrestricted, the answer has been removed.

BERT also introduced a second pre-training objective: Next Sentence Prediction (NSP). Given two sentences, predict whether the second actually follows the first in the original text. This was intended to teach the model about inter-sentence relationships. Interestingly, later work (RoBERTa, 2019) showed that NSP doesn't help — and may actually hurt performance. The MLM objective alone is sufficient for learning powerful contextual representations. Smith mentions this as an area of active investigation.

BERT vs ELMo vs GPT

AspectELMoGPTBERT
ArchitecturebiLSTMTransformer decoderTransformer encoder
DirectionShallow bidirUnidirectionalDeep bidir
ObjectiveBidir LM (concat)Left-to-right LMMLM + NSP
TransferFeature extractionFine-tuningFine-tuning
Parameters94M117M110M / 340M
Context at each layerLeft OR rightLeft onlyLeft AND right
Three Paradigms: Static → Shallow Bidir → Deep Bidir

Compare how the representation of "bank" changes across the three paradigms. Static: one fixed vector. ELMo: different per context, but left/right don't interact during encoding. BERT: fully context-dependent at every layer.

BERT configurations

BERT comes in two sizes, allowing researchers to trade off between performance and compute:

ConfigLayersHiddenAttention HeadsParameters
BERT-Base1276812110M
BERT-Large24102416340M

BERT-Base is deliberately designed to match GPT's architecture (12 layers, 768 hidden, 110M parameters) — making their comparison a clean test of training objective (MLM vs left-to-right LM) rather than model size.

Pre-training data is also larger than GPT's: BERT uses BooksCorpus (800M words) + English Wikipedia (2,500M words) = ~3.3 billion words total. This is roughly 3x GPT's training data, giving BERT a knowledge advantage on top of its architectural advantage.

The results

BERT's improvement over GPT and ELMo was not marginal — it was decisive:

BenchmarkELMoGPTBERT-Large
GLUE score68.672.880.5
SQuAD 1.1 (F1)85.889.093.2
MNLI (acc)76.482.186.7

The gap between BERT and GPT (same architecture family, different training objective) is larger than the gap between GPT and ELMo (different architecture, different training objective). This strongly suggests that the bidirectional training objective matters more than the specific architecture choice.

Smith makes a nuanced observation here: BERT's success doesn't prove that bidirectionality is always better. For understanding tasks (classification, QA, NLI), bidirectionality is clearly superior — the model needs to see the whole input. But for generation tasks (text completion, translation, summarization), left-to-right models have an inherent advantage because they can generate text autoregressively. This distinction between "understanding models" and "generation models" would persist for years, until GPT-3 showed that sufficiently large unidirectional models could do both.

python
# BERT fine-tuning: adding a task-specific head
from transformers import BertForSequenceClassification

# Sentiment classification example
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # positive / negative
)
# Architecture:
# Input → BERT encoder (12 layers) → [CLS] token → Linear(768, 2)
# Fine-tune ALL weights for 3 epochs with lr=2e-5

# The [CLS] token's representation captures the entire sentence
# because self-attention lets it attend to ALL other tokens
# bidirectionally — this is why BERT excels at classification
The BERT moment in NLP: BERT's release in October 2018 is often compared to the ImageNet moment in computer vision (2012). Before BERT, NLP researchers trained specialized models for each task. After BERT, they fine-tuned a single pre-trained model. This dramatically lowered the barrier to entry: a graduate student with a single GPU could achieve state-of-the-art results on most NLP benchmarks by fine-tuning BERT, whereas before it required extensive feature engineering and domain expertise.
Why did BERT outperform both ELMo and GPT despite having a similar number of parameters?

Chapter 5: Fine-Tuning vs Feature Extraction

Smith's paper carefully distinguishes two ways to use pre-trained contextual representations. This distinction, seemingly minor, has major practical implications.

Feature extraction (ELMo-style)

Run the pre-trained model on your input and extract the hidden states. Use these as fixed features — feed them into your own downstream model without modifying the pre-trained weights.

Pre-trained model
Frozen — weights don't change. Produces contextual vectors for each token.
↓ fixed features
Task-specific model
Trained from scratch. Receives pre-trained features as input.

Fine-tuning (GPT/BERT-style)

Initialize with pre-trained weights, add a thin task-specific layer, and train the entire model end-to-end with a small learning rate.

Pre-trained model
Initialized with pre-trained weights. All weights are updated during training.
↓ all gradients flow through
Task-specific head
Usually just a linear layer. Trained jointly with the pre-trained model.
The practical tradeoff: Feature extraction is cheaper (compute features once, cache them, train lightweight downstream models). Fine-tuning is more powerful (representations adapt to the task) but requires more compute and risks catastrophic forgetting. In practice, fine-tuning almost always wins — the performance gap is 1-5% on most benchmarks — which is why the field converged on it.

Let's make this concrete with numbers. For a sentiment classification task with 10K training examples:

python
# Feature extraction approach
# 1. Run BERT once on all 10K examples (fixed cost)
# 2. Cache the [CLS] vectors: 10K × 768 floats = 30 MB
# 3. Train a linear classifier on cached features
# Total GPU time: ~5 min (BERT inference) + ~1 min (classifier)
# Accuracy: ~91%

# Fine-tuning approach
# 1. Initialize BERT + linear head
# 2. Train end-to-end for 3 epochs
# Total GPU time: ~30 min (full backprop through BERT)
# Accuracy: ~93%

# The 2% gap seems small, but on leaderboards it's the
# difference between state-of-the-art and "also participated"

Smith also discusses an intermediate approach: gradual unfreezing. Instead of fine-tuning all layers at once, you start by training only the task head, then progressively unfreeze deeper layers. This reduces the risk of catastrophic forgetting while still allowing the representations to adapt. Howard and Ruder (2018) showed this approach works well in practice, though it adds hyperparameter complexity.

A later approach that became dominant is adapter layers (Houlsby et al., 2019). Instead of fine-tuning all parameters, you insert small trainable "adapter" modules between the frozen pre-trained layers. Only the adapters (typically 1-5% of total parameters) are trained. This gives fine-tuning-level performance with feature-extraction-level storage efficiency (one frozen model + many small adapters, instead of many full model copies).

python
# Three transfer learning approaches compared

# 1. Feature extraction (ELMo-style)
# Trainable params: ~100K (task head only)
# Storage: 1 shared model + 1 head per task
# Accuracy: 91%

# 2. Fine-tuning (BERT-style)
# Trainable params: 110M (all BERT + head)
# Storage: 1 full model copy per task
# Accuracy: 93%

# 3. Adapters (post-Smith, but natural evolution)
# Trainable params: ~2M (adapter layers only)
# Storage: 1 shared model + adapters per task
# Accuracy: 92.5%

# The field converged on fine-tuning for maximum performance,
# but adapters/LoRA became standard when storage is a concern

The distinction between feature extraction and fine-tuning also connects to a deeper question Smith raises: what do pre-trained representations actually encode? If fine-tuning works better, it suggests the pre-trained representations are good but not perfect for any specific task — they need to be "adjusted" to the target domain. If feature extraction works well, it suggests the representations are already task-ready. In practice, the answer is somewhere in between: pre-trained representations encode general linguistic knowledge that is useful for many tasks, but fine-tuning can specialize them for maximum performance on any specific task.

Smith provides a helpful analogy: feature extraction is like asking an expert for advice but not letting them see the specific problem. Fine-tuning is like hiring the expert to work on your specific problem full-time. The expert's general knowledge is the same, but the latter produces better results because the expert can adapt their approach to the specific situation.

When to use each

FactorFeature ExtractionFine-Tuning
Best whenMany tasks share one model, compute is limitedMaximum performance on one task
Training costLow (only task head trains)Medium (all params update)
Data neededVery little (100s of examples)Little (1000s of examples)
RiskUnder-adaptationCatastrophic forgetting
StorageOne model, many headsFull model copy per task
Fine-Tuning vs Feature Extraction

Compare the two transfer learning strategies. In feature extraction, the pre-trained weights are frozen (gray). In fine-tuning, all weights are updated (colored). Watch how performance changes with training steps for each approach.

Training steps 0

Probing tasks: what do contextual representations know?

Smith discusses probing tasks — simple classifiers trained on frozen representations to test what linguistic information they encode. If a linear classifier can predict POS tags from layer 1 of BERT, then POS information is linearly decodable from that layer.

Probing studies revealed a remarkable finding: contextual representations encode a near-complete model of language structure, organized hierarchically by layer depth. The representations aren't just useful — they contain a systematic encoding of linguistic knowledge:

Linguistic PropertyWhere EncodedProbe Accuracy
Part-of-speech tagsLower layers (1-4)97%+
Dependency relationsMiddle layers (4-8)90%+
Semantic rolesUpper layers (6-10)85%+
CoreferenceUpper layers (8-12)80%+
World knowledgeDistributed across layersVaries widely

The layer specialization finding is remarkable. Lower layers capture surface-level features (spelling, POS), middle layers capture syntactic structure (parse trees, dependencies), and upper layers capture task-relevant semantics (sentiment, entailment). This suggests that contextual representations build meaning progressively — from form to syntax to semantics — much like how linguistic theory organizes language into levels.

python
# Probing experiment: what does each BERT layer know?
import torch
from transformers import BertModel

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

def probe_layer(layer_idx, task_data, task_labels):
    """Train a linear classifier on frozen BERT representations"""
    features = []
    for text in task_data:
        tokens = tokenizer(text, return_tensors='pt')
        with torch.no_grad():
            outputs = model(**tokens)
        # Extract representations from specific layer
        h = outputs.hidden_states[layer_idx]  # [1, seq_len, 768]
        features.append(h.mean(dim=1))  # average pool
    # Train linear probe: features → task_labels
    # Accuracy tells us how much info layer_idx encodes

# Result: POS tagging peaks at layer 1-2
# Result: Dependency parsing peaks at layer 4-6
# Result: Sentiment peaks at layer 10-12

Smith also discusses the probing controversy. Some researchers argue that high probing accuracy doesn't prove the model "knows" a linguistic property — it might just be an artifact of the probe's capacity. A sufficiently complex probe could extract any information from any representation, even random ones. The community addressed this by using minimal probes (linear classifiers), control tasks (random baselines), and information-theoretic analyses to validate probing results.

What is a "probing task" in the context of analyzing pre-trained representations?

Chapter 6: Representation Explorer

Let's bring the whole progression together in one interactive simulation. Here you can see how the same word gets represented differently across the three eras: static (Word2Vec/GloVe), shallow contextual (ELMo), and deep contextual (BERT).

This is the key experiment that validates the entire paradigm shift. If contextual representations really capture meaning-in-context, then the same word in different contexts should have different representations — and the degree of difference should correlate with how different the meanings are. Static embeddings give similarity of 1.0 (identical vectors, by definition). ELMo reduces this to ~0.6 (some differentiation). BERT reduces it further to ~0.4 (substantial differentiation). The progression is clear and quantifiable.

From Static to Contextual: Evolution Explorer

Select a polysemous word and two different contexts. Watch how each paradigm represents the word: static puts it in one place regardless of context; ELMo gives different vectors but from independent left/right processing; BERT produces deeply contextual vectors where all context interacts. The similarity scores show how well each paradigm distinguishes meanings.

The key quantitative finding: cosine similarity between the same word in different contexts:

Model"bank" (finance vs river)"bat" (animal vs sport)"spring" (season vs coil)
Static (GloVe)1.00 (identical)1.00 (identical)1.00 (identical)
ELMo0.650.580.72
BERT layer 120.420.350.48
The trend is clear: As we move from static to shallow contextual to deep contextual, the same word in different contexts becomes more differentiated (lower cosine similarity). BERT layer 12 gives "bank-financial" and "bank-river" a similarity of only 0.42 — they're effectively different words at this point. This is exactly what we want: the representation captures meaning, not just form.

The geometry of contextual representations

Ethayarajh (2019) studied the geometry of contextual representations and found something surprising: as you go deeper into BERT, the representations become more anisotropic — they occupy a narrow cone in the high-dimensional space rather than being uniformly distributed. This means contextual representations aren't using the full capacity of the vector space. Later work (whitening, isotropy correction) addressed this geometric inefficiency.

The anisotropy finding has practical implications. If all representations cluster in a narrow cone, cosine similarity between any two words tends to be high (~0.5-0.7) regardless of meaning. This makes it harder to distinguish similar from dissimilar words using raw cosine similarity. Post-processing techniques like mean centering and whitening can mitigate this, pushing the representations to use more of the available space.

Layer-by-layer contextualization: BERT's layer 0 representations are essentially static — the same word in different contexts has high similarity (~0.95). Each successive layer adds more context, and by layer 12, the same word in different contexts can have very different representations (similarity as low as 0.3). This progressive contextualization is one of the most important findings from the probing literature, and it confirms that contextual representations aren't just "better embeddings" — they're a fundamentally different kind of representation that builds meaning incrementally through layers.
python
# Comparing representations across eras
import torch
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def get_word_vec(sentence, word_idx, layer):
    tokens = tokenizer(sentence, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**tokens)
    return outputs.hidden_states[layer][0, word_idx]

# "bank" in financial vs river context
v1 = get_word_vec("I deposited money at the bank", 6, 12)
v2 = get_word_vec("I sat on the river bank", 6, 12)
sim = torch.nn.functional.cosine_similarity(v1, v2, dim=0)
print(f"BERT layer 12 similarity: {sim:.3f}")  # ~0.42

# Layer 0 (mostly static-like)
v1_0 = get_word_vec("I deposited money at the bank", 6, 0)
v2_0 = get_word_vec("I sat on the river bank", 6, 0)
sim_0 = torch.nn.functional.cosine_similarity(v1_0, v2_0, dim=0)
print(f"BERT layer 0 similarity: {sim_0:.3f}")  # ~0.95
As you go from static embeddings to BERT's upper layers, what happens to the cosine similarity between the same word in different contexts (e.g., "bank" in financial vs river contexts)?

Chapter 7: Connections

Smith's survey captured a moment of fundamental transition in NLP. The shift from static to contextual representations was not just a technical improvement — it changed what we mean by "word meaning" in computational linguistics.

The five-year revolution

The shift from static to contextual representations happened astonishingly fast — just five years from Word2Vec (2013) to BERT (2018). In that time, the field went from "words as fixed points" to "words as functions of context," fundamentally changing how NLP systems represent meaning.

The progression

EraModelKey IdeaRepresentation
2013Word2VecPredict context from wordsOne vector per word type
2014GloVeFactorize co-occurrence matrixOne vector per word type
2018ELMoBidirectional LSTM LMOne vector per word token (shallow bidir)
2018GPTTransformer LM + fine-tuneOne vector per word token (unidirectional)
2018BERTMasked LM + fine-tuneOne vector per word token (deep bidir)

What came next (beyond Smith's survey)

Smith's paper was written in early 2019, just months after BERT's release. The field was already moving fast. Here's how the landscape evolved in the years that followed:

DevelopmentContribution
RoBERTa (2019)Showed BERT was undertrained — more data, longer training, no NSP improved results
GPT-2/3 (2019/2020)Showed that scale (1.5B → 175B params) makes unidirectional models competitive with bidirectional
T5 (2020)Unified all NLP as text-to-text with an encoder-decoder Transformer
Llama 3 (2024)Open-weight models at 8-405B scale with 15T tokens of training data
BERT → ChatGPT (2022)The contextual representation paradigm, scaled to 175B+ parameters + RLHF, enables conversational AI

The most surprising development after Smith's paper was the emergence of in-context learning in GPT-3 (Brown et al., 2020). It turned out that sufficiently large unidirectional models (175B parameters) could perform tasks without any fine-tuning at all — just by seeing a few examples in the prompt. This is contextual representation taken to the extreme: the entire prompt (including task examples) becomes the "context" that shapes the model's representations and outputs.

python
# The evolution of "how to use pre-trained representations"
# 2018 (ELMo): Freeze model, extract features
#   embed = elmo(text)  # fixed features
#   classifier = train_new_model(embed, labels)

# 2018 (BERT): Fine-tune entire model
#   model = BERT + linear_head
#   model = train(model, task_data, lr=2e-5)

# 2020 (GPT-3): No training at all — in-context learning
#   prompt = "Classify sentiment:\n"
#          + "Great movie! → positive\n"
#          + "Terrible plot → negative\n"
#          + "Amazing cast → "
#   answer = gpt3.generate(prompt)  # "positive"

# 2023 (ChatGPT/Claude): Natural conversation
#   "Is 'The Godfather' a good movie?" → detailed review

Limitations Smith identified

Smith was prescient about several limitations that remain relevant even in 2024:

1. Computational cost. Contextual representations require running a large neural network on every input, making them orders of magnitude more expensive than static lookups. BERT-Base requires ~110M multiply-adds per token vs a single 300-dimensional lookup for GloVe.
2. Interpretability. Static embeddings are relatively interpretable — "king - man + woman = queen" is intuitive. Contextual representations in a 768-dimensional space at layer 8 of a 12-layer Transformer are much harder to understand, despite probing studies making progress.
3. The evaluation gap. Pre-training datasets are enormous and diverse, while evaluation benchmarks are small and narrow. We don't really know what contextual representations know — we can only test the few things we think to ask.
4. Bias and fairness. Pre-trained representations inherit biases from their training data. If "doctor" appears more often with "he" and "nurse" with "she" in the training corpus, the embeddings encode this gender bias. Static embeddings had this problem (Bolukbasi et al., 2016), and contextual embeddings may amplify it because they learn from even more data. Smith notes that debiasing contextual representations is an open challenge — and it remains so today.

Each of these limitations has spawned entire research subfields. Efficient Transformers (Tay et al., 2020) address the compute cost. BERTology (Rogers et al., 2020) addresses interpretability. Dynamic benchmarks (e.g., Dynabench) address the evaluation gap. Fairness-aware pre-training addresses bias. Smith's paper correctly identified the frontier of research that would occupy the field for years to come.

The philosophical shift

Perhaps the deepest contribution of Smith's paper is articulating the philosophical shift. In the static embedding era, meaning was treated as a property of word types — "bank" has meaning X, regardless of context. In the contextual era, meaning is treated as a property of word tokens — "bank" in "river bank" has meaning Y, while "bank" in "money bank" has meaning Z. This is a fundamental shift in how computational linguistics models language, and it aligns with how linguists have always understood meaning: you can't define a word's meaning without considering its context of use.

The field has since moved far beyond the models Smith surveyed, but the fundamental insight of his paper remains: meaning is contextual. A word's meaning is not fixed; it's constructed in real-time from the words around it. This shift from type-level to token-level representations was the conceptual foundation for everything that followed — GPT-3, ChatGPT, Claude, and the entire large language model revolution.

Looking back from 2024, it's clear that Smith identified the right paradigm shift at the right moment. The field's trajectory since then — ever-larger pre-trained models, ever more sophisticated fine-tuning techniques, and the emergence of in-context learning — all build on the foundation of contextual representations. GPT-3's ability to do "few-shot learning" (give it a few examples in the prompt) is essentially the contextual representation paradigm taken to its extreme: the model's representation of each word depends on the entire prompt, including the task examples.

The paper also presciently warned about the environmental and social costs of large-scale pre-training, a concern that would become much more prominent in subsequent years with the work of Bender et al. (2021) and others. The tension between model capability and compute cost remains one of the central challenges in modern NLP.

python
# The full progression: from one-hot to contextual
# 1990s: one-hot encoding
#   "cat" = [0, 0, 0, ..., 1, ..., 0]  # V-dimensional sparse
#   similarity("cat", "dog") = 0  # no notion of similarity

# 2013: Word2Vec/GloVe (static embeddings)
#   "cat" = [0.23, -0.11, 0.87, ...]  # 300-dimensional dense
#   similarity("cat", "dog") = 0.76  # captures semantics
#   but: "bank" always the same vector, regardless of context

# 2018: ELMo (shallow contextual)
#   "bank" in "river bank" ≠ "bank" in "money bank"
#   similarity between them: 0.65 (partially differentiated)

# 2018: BERT (deep contextual)
#   "bank" representations fully context-dependent
#   similarity between meanings: 0.42 (well differentiated)
#   Each layer adds more context → more differentiation
The ultimate irony: Smith's paper made the case for encoder-based bidirectional models (BERT) over decoder-based unidirectional models (GPT). Within two years, GPT-3 showed that scaling unidirectional models to 175B parameters made them competitive with BERT on understanding tasks — while also being able to generate text. The field eventually converged on decoder-only architectures (GPT-4, Claude, Llama) that, despite being "unidirectional," are powerful enough to compensate through sheer scale. The bidirectionality advantage that Smith emphasized turned out to be real but surmountable.

"You shall know a word by the company it keeps."
— John Rupert Firth, 1957

What is the single most important conceptual shift that Smith's paper documents?