From counting word pairs to networks with memory — and why they break on long sentences.
Your phone keyboard just predicted "you" after "thank." It assigned probabilities to every word in its vocabulary and picked the highest ones to show you. That prediction engine is a language model.
A language model answers one question: given the words so far, what comes next? More precisely, it computes a probability distribution over the entire vocabulary for the next word. "Thank" is followed by "you" with probability 0.62, "goodness" with probability 0.08, "god" with probability 0.05, and so on through every word in the dictionary.
Why does this matter? Because nearly every NLP application you've ever used — translation, summarization, chatbots, code completion, search autocomplete — is, at its core, a language model. Google Translate generates translations one word at a time, each time asking "what's the most probable next word in the target language given the source sentence and the words generated so far?" ChatGPT does the same thing: it produces one token at a time, sampling from a probability distribution.
Even evaluating text quality uses language models. A well-formed English sentence should have high probability under a good language model. "The cat sat on the mat" should score higher than "The cat sat on the the the the." This is how spam filters, grammar checkers, and speech recognition systems rank candidate sentences.
Formally, a language model assigns a probability to a sequence of words:
By the chain rule of probability, we can decompose this into a product of conditional probabilities:
This is exact — no approximation. The entire challenge of language modeling is computing those conditional probabilities well. How do you estimate P("mat" | "the cat sat on the")? That's the question every model in this lesson tries to answer.
Let's compute the probability of a tiny sentence: "the cat sat." Using the chain rule:
If P("the") = 0.07 (the most common English word), P("cat" | "the") = 0.002, and P("sat" | "the cat") = 0.05, then:
That's tiny — and this is only three words. For a 20-word sentence, you're multiplying 20 small numbers together. The result is astronomically small. This is why we work with log probabilities in practice: log(0.000007) = −11.87. Sums of logs are much more numerically stable than products of tiny numbers.
The full conditional P(wi | w1, ..., wi−1) depends on all previous words. For a sentence with a vocabulary of 50,000 words, at position 10, we'd need to estimate a distribution conditioned on all possible 9-word histories. That's 50,0009 possible histories — far more than any corpus could cover. We need approximations.
Let's see this in action. Below is a simulated phone keyboard. A partial sentence is shown at the top, and a bar chart displays the probabilities of the top candidate next words. Click a word to extend the sentence and watch the distribution shift.
Click a word to extend the sentence. The probability bars update based on context. Click "Reset" to start over.
Notice what happened. Every time you picked a word, the entire distribution shifted. After "I" the model strongly favored "am," "think," "want." After "I want" it shifted to "to," "a," "the." The context determines the distribution. A language model with more context and better modeling of that context will make better predictions.
This lesson walks through three generations of language model: counting (n-grams), neural networks with fixed windows, and neural networks with memory (RNNs). Each solves a problem the previous one couldn't handle. By the end, you'll understand why RNNs were revolutionary — and why they were replaced.
The simplest language model just counts. How often does "the" follow "in"? Count it up, divide by how many times "in" appears total. That ratio is your probability estimate. This is a bigram model — it predicts each word from the one word before it.
The key assumption is the Markov assumption: the probability of a word depends only on the previous n−1 words, not the entire history. For a bigram (n=2), the next word depends only on the current word. For a trigram (n=3), it depends on the previous two words.
For a bigram model, this simplifies to:
This is just the fraction of times wi follows wi−1 in the training corpus. If "the" appears 10,000 times in your data, and "the cat" appears 200 times, then P(cat | the) = 200/10,000 = 0.02. Simple counting. No neural networks, no parameters to learn — just a big lookup table.
The full sentence probability under a bigram model decomposes as:
For a trigram model (n=3), each word depends on the two previous words: P(wi | wi−2, wi−1). The context grows, predictions improve, but the sparsity problem grows much faster.
Let's make this concrete. Suppose our entire corpus is four sentences:
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog ate the bone
We count every pair of adjacent words. "the cat" appears 2 times. "the dog" appears 2 times. "the mat" appears 2 times. "the fish" appears 1 time. "the bone" appears 1 time. "the" appears 10 times total (as the first word of a pair). So P(cat | the) = 2/10 = 0.20, P(dog | the) = 2/10 = 0.20, P(mat | the) = 2/10 = 0.20, and so on.
Let's verify the probabilities sum to 1. P(cat|the) + P(dog|the) + P(mat|the) + P(fish|the) + P(bone|the) = 2/10 + 2/10 + 2/10 + 1/10 + 1/10 = 8/10 = 0.80. Wait — that's not 1.0. What's missing? The remaining 0.20 is spread across words that follow "the" in other positions: "sat" never directly follows "the" in this corpus, but "the" also starts the sentences, giving us "the cat" and "the dog." Looking more carefully, "the" appears as the first word of a bigram 10 times: "the cat" (2), "the mat" (2), "the dog" (2), "the fish" (1), "the bone" (1), and we need one more. In fact "on the" produces the remaining bigrams starting with "the." The exact accounting depends on sentence boundaries, but the principle holds: count, divide, get probabilities.
Now let's compute a sentence probability. Using our bigram model:
For "the fish ate": P(the|<s>) × P(fish|the) × P(ate|fish) = 1.0 × 0.10 × 0.50 = 0.05. Lower probability — "fish" follows "the" less often than "cat" does. The model captured a real (if crude) pattern from the data.
The widget below builds this bigram table step by step. Click "Next" to advance through the corpus word by word and watch the count table grow. Then click "Generate" to sample a sentence from the learned distribution.
Click "Next" to advance through the corpus. Watch the bigram count table build. Click "Generate" to sample a new sentence from the learned probabilities.
N-gram models have a fatal flaw: sparsity. If your training corpus never contains the phrase "quantum entanglement," then P(entanglement | quantum) = 0. Zero. And because we multiply probabilities across the sentence, one zero kills the entire sentence probability — it collapses to zero.
This isn't just a theoretical problem. Consider estimating P(food | "the chef prepared the") using a 5-gram model. You'd need to find the exact sequence "the chef prepared the" in your corpus. Even a billion-word corpus might not contain that exact 4-word prefix. As n grows, the number of possible n-grams grows as |V|n, where |V| is vocabulary size. With a 50,000-word vocabulary, the number of possible 5-grams is 50,0005 = 3.125 × 1023. No corpus can cover that space.
Partial fixes exist — smoothing (add a small count to every n-gram, including unseen ones) and backoff (if the 5-gram count is zero, fall back to the 4-gram, then 3-gram, etc.). But these are patches. They don't solve the fundamental issue: similar words get no shared credit. If the model has seen "the dog ran fast" a thousand times, it learns nothing about "the puppy sprinted quickly." The words are completely different strings, and counting treats them as unrelated.
There's a practical problem too. An n-gram model stores explicit counts for every observed n-gram. Google's n-gram dataset from 2006 contained 1 trillion words. The number of unique 5-grams in that dataset: 1.18 billion. Storing all of them with counts requires gigabytes of memory. And this is still sparse — the vast majority of possible 5-grams were never observed.
| n | Name | Context Window | Unique n-grams (Google 1T) |
|---|---|---|---|
| 1 | Unigram | 0 words | 13 million |
| 2 | Bigram | 1 word | 314 million |
| 3 | Trigram | 2 words | 977 million |
| 4 | 4-gram | 3 words | 1.31 billion |
| 5 | 5-gram | 4 words | 1.18 billion |
And even with trillions of words, most reasonable 5-grams are still unseen. The fundamental issue is that language is compositional — a small vocabulary combines into an astronomical number of valid sequences. Counting can never cover that space. We need a model that can generalize.
The simplest smoothing method is Laplace smoothing (add-1): add 1 to every count, including unseen n-grams. This guarantees no probability is ever zero.
But this is too aggressive — it steals far too much probability mass from observed n-grams and spreads it over the astronomically large number of unseen ones. A seen bigram that appeared 100 times might have its probability cut in half.
Kneser-Ney smoothing is much better. Instead of adding a fixed count, it subtracts a small discount d (typically 0.75) from each observed count and distributes that mass using a clever continuation probability. Kneser-Ney was the gold standard for n-gram models from 1995 until neural models took over in the 2010s.
But no smoothing method can solve the fundamental problem: if the model has never seen any context similar to "the quantum computer calculated," it has no useful signal. All smoothed probabilities for unseen contexts are essentially uniform — the model is guessing randomly. We need representations that encode similarity.
python from collections import Counter, defaultdict class BigramLM: def __init__(self): self.bigram_counts = defaultdict(Counter) self.unigram_counts = Counter() def train(self, corpus): # corpus: list of sentences, each a list of words for sentence in corpus: for i in range(len(sentence) - 1): w1, w2 = sentence[i], sentence[i+1] self.bigram_counts[w1][w2] += 1 self.unigram_counts[w1] += 1 def prob(self, w2, w1, alpha=0.01): # P(w2 | w1) with add-alpha smoothing count = self.bigram_counts[w1][w2] total = self.unigram_counts[w1] vocab_size = len(self.unigram_counts) return (count + alpha) / (total + alpha * vocab_size)
What if "cat" and "kitten" being similar words could help predictions? If the model has seen "the cat sat on the mat," it should give some probability to "the kitten sat on the mat" even without ever seeing that exact phrase. N-gram models can't do this — "cat" and "kitten" are just different strings. But a neural network can map words to continuous vectors where similar words live nearby, and then operate on those vectors instead of discrete strings.
This is exactly what Bengio et al. proposed in 2003 in a landmark paper titled "A Neural Probabilistic Language Model." The idea: replace the giant lookup table of n-gram counts with a neural network that takes a fixed window of previous words, looks up their embeddings (dense vectors of ~100-300 dimensions), and predicts the next word.
Here's how the fixed-window neural language model works, step by step:
The magic is in step 1. The embedding matrix E is a [|V| × d] matrix where each row is a word's vector. "Cat" and "kitten" end up with similar vectors because they appear in similar contexts during training. So any pattern learned about "cat" automatically applies to "kitten" — the network has generalized through the embedding space.
Below is an interactive diagram of this architecture. Hover over each component to see its tensor shape and purpose. The 4-word window feeds into embedding lookup, concatenation, a hidden layer, and finally softmax output.
Hover over or tap each component to see its tensor shape and role. The input is a 4-word window; the output is a probability distribution over the vocabulary.
| Property | N-gram Model | Neural LM (Bengio 2003) |
|---|---|---|
| Representation | Words are discrete symbols | Words are continuous vectors |
| Generalization | None — each n-gram independent | Similar words share predictions |
| Unseen n-grams | Zero probability (without smoothing) | Non-zero — embeddings interpolate |
| Context size | n−1 words (fixed, typically 3-5) | n−1 words (fixed, typically 4-8) |
| Parameters | Counts table (sparse, huge) | Dense matrices (~1-10M params) |
| Training | Just counting | Backpropagation + gradient descent |
But there's a catch. The neural LM still uses a fixed window. If your window is 4 words, you can't see the 5th word back. Want more context? Make the window bigger. But each additional word adds d parameters to the concatenated vector, which means a larger weight matrix for the hidden layer. A 10-word window with 300-dimensional embeddings creates a concatenated vector of 3,000 dimensions. The hidden layer weight matrix is [3000 × h] — that's a lot of parameters and a lot of computation.
Worse, the model treats position 1 and position 4 completely differently. The weights applied to the first word in the window are entirely separate from the weights applied to the fourth word. There's no parameter sharing across positions. The model learns "what does word-in-position-1 mean for the prediction" and "what does word-in-position-4 mean for the prediction" as separate things, even though both are just "a word in the context."
Let's trace the exact dimensions for a concrete model. Suppose vocab_size = 10,000, embed_dim = 128, context_len = 4, hidden_dim = 256.
| Component | Input Shape | Weight Shape | Output Shape |
|---|---|---|---|
| Embedding lookup | [batch, 4] (integers) | E: [10000, 128] | [batch, 4, 128] |
| Concatenation | [batch, 4, 128] | — | [batch, 512] |
| Hidden layer | [batch, 512] | W1: [512, 256] | [batch, 256] |
| Output + softmax | [batch, 256] | W2: [256, 10000] | [batch, 10000] |
Total trainable parameters: 10,000 × 128 (embeddings) + 512 × 256 (hidden) + 256 × 10,000 (output) ≈ 3.97M parameters. The output layer alone is 2.56M — the most expensive part, because it maps from hidden_dim to the full vocabulary. This "softmax bottleneck" is a recurring theme: the output layer dominates parameter count for language models with large vocabularies.
Later, weight tying (sharing the embedding matrix with the output matrix) and adaptive softmax (hierarchical output for rare words) became standard tricks to reduce this cost. We'll see weight tying again in the LSTM chapter.
python import torch import torch.nn as nn class NeuralLM(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, context_len): super().__init__() self.embeddings = nn.Embedding(vocab_size, embed_dim) # [|V|, d] self.hidden = nn.Linear(context_len * embed_dim, hidden_dim) # [n*d, h] self.output = nn.Linear(hidden_dim, vocab_size) # [h, |V|] def forward(self, x): # x: [batch, context_len] — integer word indices e = self.embeddings(x) # [batch, context_len, embed_dim] e = e.view(e.size(0), -1) # [batch, context_len * embed_dim] — concat h = torch.tanh(self.hidden(e)) # [batch, hidden_dim] logits = self.output(h) # [batch, vocab_size] return logits # apply softmax for probabilities
Notice how context_len is baked into the hidden layer's input dimension. Changing the context length means rebuilding the entire model. We need an architecture where context length is variable.
Bengio's 2003 paper was a watershed moment. Before it, NLP was entirely count-based. The paper showed three things:
First, that word embeddings — dense vectors learned from data — could capture semantic similarity automatically. Words appearing in similar contexts got similar vectors, without anyone telling the model that "cat" and "kitten" are related.
Second, that a neural network could outperform n-gram models on perplexity, the standard language model metric. This was controversial at the time — many researchers believed that decades of engineering on smoothing and backoff had brought count-based models to near-optimal performance. Bengio showed there was a lot more room for improvement.
Third, and most importantly, it proved that neural language models could generalize to unseen word combinations. If the model learned that "the dog ran in the park" is probable, it automatically assigned some probability to "the cat ran in the park" because "dog" and "cat" have similar embeddings. N-gram models could never do this.
What if the network had memory? Instead of cramming the last n words into a fixed-size window, it reads one word at a time and maintains an internal state — a running summary of everything it has seen so far. That's the core idea of a recurrent neural network (RNN).
At each timestep t, the RNN takes two inputs: the current word xt (as an embedding vector) and the previous hidden state ht−1 (the memory). It combines them to produce a new hidden state ht, which serves as the memory for the next timestep. The hidden state is initialized to zeros at the start of a sequence.
This is the Elman network, proposed by Jeff Elman in 1990. It's the simplest possible recurrent architecture, and it remained the default for 7 years until LSTMs replaced it in 1997. The key insight: by feeding the hidden state back into itself, the network can theoretically remember information from arbitrarily far in the past. The word "theoretically" is doing a lot of work here — we'll see why in Chapter 5.
Let's unpack each piece:
Wh is the hidden-to-hidden weight matrix [dh × dh]. It determines how the previous memory is transformed. Think of it as the network's "memory filter" — which parts of the old memory matter for the new state.
Wx is the input-to-hidden weight matrix [dh × dx]. It determines how the current word influences the new state. Think of it as the "new information integrator."
b is a bias vector [dh], and tanh squashes the result to [−1, 1], keeping values bounded.
To predict the next word, we pass ht through a linear layer and softmax:
Here ŷt is a probability distribution over the entire vocabulary. We pick the word with the highest probability (or sample from the distribution).
An RNN is often drawn as a loop, but it's easier to understand when unrolled across time. Each timestep is a copy of the same network with the same weights, connected by the hidden state flowing left to right.
Below, click "Next Word" to advance through a sentence one word at a time. Watch the hidden state (shown as a heatmap strip) update at each step. The arrow from the previous h to the current h shows the recurrence — that's what gives the RNN its memory.
Click "Next Word" to advance. The hidden state heatmap shows how the internal representation evolves. The recurrent arrow carries memory from each step to the next.
The hidden state ht is the RNN's compressed representation of the entire sequence so far. After processing "the cat sat on the," h5 contains some encoding of all five words — enough information (we hope) to predict that "mat" is likely next.
But it's a fixed-size vector. Whether the sentence is 5 words or 50 words, ht is always the same dimensionality (typically 128–1024). The RNN must compress everything into this bottleneck. Short sentences? Easy. Long sentences? The early words get overwritten as new words push in. We'll see exactly why this is a problem in Chapter 5.
Think of the hidden state as a whiteboard. At each timestep, the RNN reads the current word and updates the whiteboard. But the whiteboard has a fixed size. Early on, there's plenty of room. After 50 words, the whiteboard is full and every new word requires erasing something old. The question is: does the RNN erase the right things? Vanilla RNNs are bad at this — they tend to overwrite important early information with less important recent information. LSTMs fix this with explicit gates that control what gets erased.
To use an RNN as a language model, we feed it one word at a time and ask it to predict the next word at each step. The loss is computed at every timestep:
Where ht is the hidden state after processing w1, ..., wt, and P(wt+1 | ht) comes from the softmax output layer. This is the standard language model training objective — minimize cross-entropy between predicted and true next words.
python import torch import torch.nn as nn class SimpleRNN(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim): super().__init__() self.embed = nn.Embedding(vocab_size, embed_dim) # [|V|, d_x] self.W_h = nn.Linear(hidden_dim, hidden_dim) # h_{t-1} → h_t self.W_x = nn.Linear(embed_dim, hidden_dim) # x_t → h_t self.W_y = nn.Linear(hidden_dim, vocab_size) # h_t → logits self.hidden_dim = hidden_dim def forward(self, x, h=None): # x: [batch, seq_len] — integer indices batch = x.size(0) if h is None: h = torch.zeros(batch, self.hidden_dim) # h_0 = 0 outputs = [] for t in range(x.size(1)): # loop over timesteps x_t = self.embed(x[:, t]) # [batch, d_x] h = torch.tanh(self.W_h(h) + self.W_x(x_t)) # [batch, d_h] outputs.append(h) h_stack = torch.stack(outputs, dim=1) # [batch, seq_len, d_h] logits = self.W_y(h_stack) # [batch, seq_len, |V|] return logits, h
Key observation: the for loop over timesteps is sequential. Step t depends on step t−1. Unlike a feedforward network where all computations can be parallelized, the RNN must process the sequence in order. This sequential bottleneck is one reason RNNs were eventually replaced by Transformers (which process all positions in parallel).
Let's trace an RNN with hidden_dim = 3 and embed_dim = 2 through two timesteps. This is artificially tiny, but it makes the math concrete.
Suppose our word embeddings are: "the" = [0.5, −0.3], "cat" = [0.8, 0.2]. And our weight matrices are:
Step 1 (x = "the"): h0 = [0, 0, 0] (initialized to zeros).
Step 2 (x = "cat"): Now h1 feeds back in through Wh.
Notice how h2 is different from h1 — it encodes information from both "the" and "cat." The contribution of "the" is indirect, filtered through Wh. After many more timesteps, the influence of "the" would be filtered through Wh many times — which is exactly why gradients vanish.
A single-layer RNN computes one hidden state per timestep. A stacked RNN feeds the output of layer 1 into layer 2, and so on. Each layer learns increasingly abstract representations.
Layer 1 might learn surface-level patterns (word boundaries, part of speech). Layer 2 might learn syntactic patterns (phrase structure, subject-verb agreement). In practice, 2–3 layers work best for RNN language models. More layers make vanishing gradients worse (gradients must flow through both time AND depth), so residual connections are often added between layers.
The best RNN language model (AWD-LSTM, Merity et al. 2018) used 3 LSTM layers with 1150 hidden units each, plus dropout between layers, weight tying (sharing the embedding and output weight matrices), and weight dropout within the LSTM. These regularization tricks were essential — without them, deep RNNs overfit badly.
| Advantage | Limitation |
|---|---|
| Can handle any input length | Sequential processing — can't parallelize across timesteps |
| Parameter sharing across time | Fixed-size hidden state is a bottleneck for long sequences |
| Captures word order (unlike bag-of-words) | Vanishing gradients limit effective memory to ~5-10 steps |
| Simple architecture, easy to implement | Slow training on long sequences (no GPU parallelism in time) |
How do you train a network that recycles the same weights at every step? You unroll it across time and apply backpropagation through the entire chain. This is called Backpropagation Through Time (BPTT).
At each timestep t, the RNN produces a prediction ŷt and incurs a loss (typically cross-entropy between the predicted distribution and the true next word). The total loss is the sum (or average) across all timesteps:
This is the negative log-likelihood of the training data under the model — the standard language model training objective. Lower is better. A perfect model that assigns probability 1.0 to every true next word achieves loss 0.
Step 4 is the crucial insight. Because Wh appears at every timestep, we compute ∂L/∂Wh by summing the contributions from every single timestep:
Each term ∂Lt/∂Wh requires backpropagating through all timesteps from t back to 1. The gradient from the loss at step T must travel backward through T matrix multiplications to reach step 1. For a 100-word sentence, that's 99 multiplications by the Jacobian ∂ht/∂ht−1.
Compare this to a feedforward network, where each weight matrix appears exactly once. In a feedforward net with L layers, the gradient passes through L matrix multiplications. In an RNN processing T timesteps, the gradient passes through T multiplications — and T can be much larger than L. A typical feedforward net has 3–10 layers. An RNN might process 100+ timesteps. That's an order-of-magnitude more multiplications for the gradient to survive.
Below, watch the forward and backward passes animate through an unrolled RNN. Click "Forward" to see activations flow left-to-right (teal). Click "Backward" to see gradients flow right-to-left (orange). Notice how the gradient intensity fades as it travels further back in time.
Click "Forward" to animate the forward pass (teal, left-to-right). Click "Backward" to animate the backward pass (orange, right-to-left). Watch how gradient strength fades at earlier timesteps.
In practice, documents can be thousands of words long. Full BPTT through the entire document would require storing all hidden states (memory) and multiplying gradients through thousands of steps (vanishing/exploding gradients). The solution: truncated BPTT. Break the document into chunks of ~35 words. Run forward through a chunk. Backpropagate only through that chunk. But carry the hidden state h forward to the next chunk (without backpropagating through it). This means the hidden state "sees" the full history, but gradients only flow back ~35 steps.
Why 35? It's a tradeoff. Shorter chunks mean less memory usage and faster backward passes, but the model can only learn dependencies within the chunk length. Longer chunks allow learning longer dependencies but use more memory (you must store all intermediate hidden states for backprop). The value 35 was found empirically to work well for word-level language models on standard benchmarks.
This is a compromise: the model's forward pass has unlimited context (the hidden state carries information from the beginning of the document), but the backward pass is limited to chunk-length dependencies. The model can use long-range information but has difficulty learning long-range dependencies from scratch.
Full BPTT requires storing the hidden state at every timestep for the backward pass. For a sequence of length T with hidden_dim d and batch_size B:
For T=1000, B=64, d=512: that's 1000 × 64 × 512 × 4 = 131 MB just for hidden states. Plus the embeddings, gradients, and optimizer states. Truncated BPTT with chunk size 35 reduces this by ~30x to a manageable 4.6 MB. This is why truncated BPTT is not optional for real training — it's a practical necessity.
python def train_step(model, sequence, optimizer): # sequence: [batch, seq_len+1] — input is seq[:-1], target is seq[1:] inputs = sequence[:, :-1] # [batch, T] targets = sequence[:, 1:] # [batch, T] optimizer.zero_grad() # Forward: RNN processes all T timesteps logits, _ = model(inputs) # [batch, T, |V|] # Loss: cross-entropy at each position, averaged loss = F.cross_entropy( logits.view(-1, logits.size(-1)), # [batch*T, |V|] targets.view(-1) # [batch*T] ) # Backward: PyTorch unrolls the graph and backprops through time loss.backward() # Clip gradients to prevent explosions torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0) optimizer.step() return loss.item()
The loss.backward() call does all the heavy lifting. PyTorch recorded every operation during the forward pass. Now it unrolls that computation graph backward, computing gradients at every timestep and accumulating them for the shared weights Wh and Wx. This is BPTT — the for loop over timesteps during the forward pass becomes a reversed for loop during backward.
How do we know if a language model is good? The standard metric is perplexity (PPL):
Perplexity is the exponentiated average cross-entropy loss. Intuitively, it measures how "surprised" the model is by the test data. A perplexity of 100 means the model is, on average, as uncertain as if it were choosing uniformly among 100 words at each position. Lower is better. A perfect model would have perplexity 1 (no surprise at all). State-of-the-art RNN language models in 2018 achieved perplexity ~60 on the Penn Treebank benchmark. Modern Transformers achieve ~15–20.
Suppose our model processes the sentence "the cat sat" and assigns these probabilities to each word:
| Position | True word | Model probability | −log(p) |
|---|---|---|---|
| 1 | "the" | 0.07 | 2.66 |
| 2 | "cat" | 0.002 | 6.22 |
| 3 | "sat" | 0.05 | 3.00 |
The model is as uncertain as if it were choosing uniformly among ~53 words. Not terrible for a small model, but far from the human level of ~10 (humans are good at predicting the next word). A well-trained LSTM achieves perplexity ~60 on PTB, meaning it effectively narrows down each word to ~60 candidates. GPT-4 likely achieves perplexity <10 on most English text.
An important note: perplexity is only comparable across models evaluated on the same test set with the same vocabulary. A model with a 10K vocabulary will have lower perplexity than one with 50K simply because it has fewer options at each step. Always report vocabulary size alongside perplexity numbers.
Another common metric is bits per character (BPC), used for character-level models. BPC = cross-entropy loss / ln(2). A BPC of 1.2 means the model needs on average 1.2 bits to encode each character. The theoretical minimum for English text is about 1.0–1.3 BPC (Shannon's estimate). State-of-the-art character models achieve ~1.08 BPC.
Multiply 0.9 by itself 100 times: 0.000027. Multiply 1.1 by itself 100 times: 13,781. That's the vanishing and exploding gradient problem in two numbers.
In BPTT, the gradient of the loss at timestep T with respect to the hidden state at timestep t involves a product of Jacobians:
Each factor ∂hk/∂hk−1 is the Jacobian of the recurrence hk = tanh(Wh hk−1 + ...). For a simplified 1D case, this is just Wh · tanh'(z). Since tanh'(z) ≤ 1, if |Wh| < 1, each factor has magnitude < 1. Multiply T − t of these factors and the product decays exponentially.
More precisely, the product's magnitude is governed by the largest eigenvalue of the weight matrix Wh. If the largest eigenvalue λmax < 1, gradients vanish. If λmax > 1, gradients explode. The sweet spot λmax = 1 is unstable — the slightest perturbation pushes you toward vanishing or exploding.
Consider the sentence: "The cat, which the family adopted from the shelter last summer because their children had been begging for a pet for years, was happy." The verb "was" must agree with the subject "cat" — 15 words earlier. For the RNN to learn this agreement, the gradient from the loss at "was" must travel 15 timesteps back to influence how "cat" is processed. If λmax = 0.9, the gradient at "cat" is 0.915 = 0.206 — already reduced by 80%. At 0.8, it's 0.815 = 0.035. At 0.7, it's 0.005. The learning signal evaporates.
Below, explore this interactively. The left panel shows how a single value changes under repeated multiplication — adjust the eigenvalue slider to see exponential growth, stability, or decay. The right panel shows an unrolled RNN with gradient arrows whose thickness represents gradient magnitude at each timestep.
Drag the eigenvalue slider. Left panel: repeated multiplication curve. Right panel: gradient thickness flowing back through the RNN. At 0.9, early gradients vanish. At 1.1, they explode.
Exploding gradients have a simple fix: gradient clipping. If the gradient norm exceeds a threshold, scale it down.
This caps the step size without changing the direction. Typical threshold: 1.0 or 5.0. It's a standard practice for any RNN training and only adds one line of code:
python # After loss.backward(), before optimizer.step() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
But vanishing gradients have no such simple fix. You can't amplify a gradient that's already zero — you've lost the signal entirely. The information about what happened 50 timesteps ago is gone. The network has "forgotten." Solving this requires a fundamentally different architecture — one where information can flow through time without being multiplied by Wh at every step.
Let's be precise about why the gradient vanishes. The recurrence is ht = tanh(Wh ht−1 + Wx xt + b). The Jacobian of ht with respect to ht−1 is:
The diag(1 − ht2) term comes from the derivative of tanh. Since tanh outputs values in (−1, 1), the diagonal entries are all in (0, 1). This means each Jacobian has norm less than ||Wh||. Over T timesteps, the product of T such Jacobians has norm bounded by ||Wh||T (approximately).
If ||Wh|| < 1, the product goes to zero. If ||Wh|| > 1, the product goes to infinity. There's no middle ground — the boundary between vanishing and exploding is a knife edge.
Bengio et al. (1994) proved this theoretically and confirmed it experimentally. They showed that a vanilla RNN trained on a simple sequence memorization task could reliably learn dependencies up to about 10 timesteps. Beyond 20 timesteps, performance dropped to chance. The network literally could not propagate information that far.
This has practical consequences. An RNN processing a document can't connect a pronoun to its referent if they're more than ~10 words apart. It can't learn that the subject of a long sentence determines the verb conjugation at the end. It can't track that a question mark at position 1 means the sentence is a question at position 30. Short-range patterns: yes. Long-range dependencies: no.
Let's trace the gradient through 3 timesteps to make this concrete. Consider a simple 1D RNN: ht = tanh(wh ht−1 + wx xt), with wh = 0.8 and all hidden states near 0 (so tanh' ≈ 1).
The gradient of h3 with respect to h0 is:
After 3 steps, the gradient has already lost half its magnitude. Extend to 10 steps:
After 10 steps: 89% of the gradient signal is lost. After 50 steps: 0.850 = 0.000014. The learning signal for long-range dependencies is 70,000 times weaker than for short-range dependencies. The network will learn local patterns (3–5 words apart) quickly, and never learn distant patterns at all.
Now consider wh = 1.1 (slightly greater than 1):
The gradient explodes. Weight updates become enormous, weights fly to infinity, and training diverges. You'll see NaN in your loss and the model is destroyed. Gradient clipping prevents this, but the vanishing problem remains unsolved by clipping.
What if the network could choose what to remember? Instead of blindly multiplying by Wh at every step, what if it had gates — learned valves that open and close to control information flow? That's the Long Short-Term Memory (LSTM) cell, invented by Hochreiter and Schmidhuber in 1997.
The key innovation is the cell state ct — a separate memory track that runs alongside the hidden state ht. Information can be added to or removed from the cell state via gates, but the default operation is addition, not multiplication. This is the crucial difference.
An LSTM has three gates, each a sigmoid layer that outputs values between 0 (fully closed) and 1 (fully open):
Forget gate ft: decides what to erase from the cell state. "Should I forget the old subject now that a new clause has started?"
Input gate it: decides what new information to write. "Is this word important enough to store?"
Output gate ot: decides what to read out from the cell state. "What part of my memory is relevant right now?"
First, compute a candidate new value to potentially write:
Then update the cell state by forgetting some old information and adding some new:
And compute the hidden state (the output of this cell):
The symbol ⊙ means element-wise multiplication. Each gate independently controls each dimension of the cell state. This is important: gate values are not scalars. Each dimension of the cell state has its own forget/input/output gate value. The network can choose to forget information in dimension 3 while retaining it in dimension 47.
Let's trace one step with a 2D cell state (tiny, but reveals the mechanics). Suppose:
ht−1 = [0.3, −0.5] ct−1 = [0.8, 0.2] xt = embedding of "sat"
After applying the gate formulas (with learned weights), suppose we get:
Read this: "Keep 90% of cell dimension 0 (where we stored the subject 'cat'). Forget 70% of dimension 1 (less important info). Write new info mostly to dimension 1." Now compute:
Dimension 0 went from 0.8 to 0.77 — almost unchanged. The forget gate (0.9) preserved the stored subject information. Dimension 1 went from 0.2 to −0.26 — mostly overwritten with new information about "sat." This selective memory is what makes LSTMs powerful.
Below, step through words one at a time and watch the three gates activate. The gate "fill levels" show their activations (0 = closed/empty, 1 = open/full). The cell state bar tracks long-term memory. Compare to a vanilla RNN: the LSTM selectively remembers and forgets.
Click "Next Word" to advance through a sentence. Watch the forget, input, and output gates open and close. The cell state bar shows long-term memory strength — compare with a vanilla RNN that forgets rapidly.
The Gated Recurrent Unit (GRU), proposed by Cho et al. in 2014, simplifies the LSTM to two gates and no separate cell state. It's cheaper to compute and often performs comparably.
| Feature | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| Cell state | Separate ct track | No separate track — uses ht only |
| Parameters | 4 × (dh2 + dh × dx) | 3 × (dh2 + dh × dx) |
| Typical use | Default for sequences | When compute is limited |
| Performance | Slight edge on long sequences | Comparable on most tasks |
GRU equations for reference:
The update gate zt plays a dual role: it controls both how much old state to keep (1 − zt) and how much new state to write (zt). When zt = 0, the hidden state passes through unchanged. When zt = 1, the old state is completely replaced. The reset gate rt controls how much of the old hidden state to use when computing the candidate.
Let's trace the gradient through the LSTM cell state update. Recall ct = ft ⊙ ct−1 + it ⊙ c̃t. The gradient of cT with respect to ct (many steps earlier) is:
Compare this to the vanilla RNN gradient ∏ Wh · diag(tanh'). The LSTM gradient is just a product of scalar forget gate values, each between 0 and 1. If the forget gates are close to 1 (meaning "keep everything"), the gradient flows almost without decay. No matrix multiplication, no eigenvalue issues.
This is why the forget gate bias is often initialized to a positive value (1.0 or 2.0) — to ensure the gates start open and gradients flow. Gers et al. (2000) showed this initialization trick dramatically improved LSTM training. The forget gate bias is one of those small details that makes a huge practical difference.
In the standard LSTM, gates only see ht−1 and xt. They don't see the cell state ct−1 directly. Peephole connections (Gers & Schmidhuber, 2000) add ct−1 as an input to the gates:
This lets the gates "peek" at the cell state when deciding what to forget or remember. Peephole connections help on tasks requiring precise timing (e.g., counting timesteps), but they add parameters and are not used in most modern implementations. PyTorch's nn.LSTM does not include peephole connections by default.
An LSTM has 4 sets of weight matrices (one for each of the 3 gates + the candidate), while a vanilla RNN has just 1. For hidden_dim = 512 and embed_dim = 256:
| Model | Weight Matrices | Total Parameters (approx) |
|---|---|---|
| Vanilla RNN | Wh [512×512] + Wx [512×256] | 393K |
| LSTM | 4 × (W [512×768]) = 4 sets | 1.57M (4x more) |
| GRU | 3 × (W [512×768]) = 3 sets | 1.18M (3x more) |
LSTMs are ~4x more expensive per step than vanilla RNNs. But they work on sequences where vanilla RNNs fail. In practice, a 2-layer LSTM with 650 hidden units was the state-of-the-art language model architecture from 2016–2018 (Zaremba et al., Merity et al.).
The best LSTM language models use several regularization and architectural tricks that are worth knowing:
Weight tying (Press & Wolf, 2017): The embedding matrix E [|V| × d] and the output weight matrix Wy [d × |V|] serve symmetric roles — one maps word IDs to vectors, the other maps vectors back to word IDs. Sharing (tying) these weights halves the model's parameter count and acts as regularization. In code: model.output.weight = model.embed.weight.
Variational dropout (Gal & Ghahramani, 2016): Standard dropout applies a different random mask at every timestep. This breaks the temporal structure — the hidden state is corrupted unpredictably at each step. Variational dropout uses the same mask at every timestep within a sequence. This preserves temporal coherence while still regularizing.
Weight dropout (Merity et al., 2018): Instead of dropping activations, drop random weights in Wh before the forward pass. This is equivalent to a random perturbation of the recurrence matrix and provides strong regularization without disrupting hidden state dynamics.
ASGD (averaged SGD): After initial training with SGD, switch to averaged SGD, which maintains a running average of the weights over training steps. The averaged weights often generalize better than the final weights. This trick gave a 2–3 point perplexity improvement in the AWD-LSTM paper.
python class LSTMLanguageModel(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2): super().__init__() self.embed = nn.Embedding(vocab_size, embed_dim) self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, dropout=0.3, batch_first=True) self.proj = nn.Linear(hidden_dim, vocab_size) self.drop = nn.Dropout(0.3) def forward(self, x, hidden=None): # x: [batch, seq_len] e = self.drop(self.embed(x)) # [batch, seq_len, embed_dim] out, hidden = self.lstm(e, hidden) # out: [batch, seq_len, hidden_dim] logits = self.proj(self.drop(out)) # [batch, seq_len, vocab_size] return logits, hidden
Time to see it all come together. Below is a character-level RNN language model running live in your browser. It reads one character at a time, updates its hidden state, and outputs a probability distribution over the next character. You can watch the hidden state evolve, see how probabilities shift, and control the "creativity" of generation with a temperature slider.
This is the exact architecture Andrej Karpathy demonstrated in "The Unreasonable Effectiveness of Recurrent Neural Networks" — a simple character-level LSTM that, given enough training data, can generate surprisingly coherent text. Our version uses pre-computed weights trained on a small English corpus.
Before sampling, we divide the logits by a temperature parameter T:
T → 0: The distribution becomes a spike on the most likely character. Output is deterministic but repetitive. The model picks the safest option every time.
T = 1.0: The original distribution. The model samples according to its learned probabilities. This is the "true" model.
T → ∞: The distribution becomes uniform. Every character is equally likely. Output is random noise.
The canvas below has three sections. The top shows the seed text and generated output. The middle displays the hidden state as a colored heatmap strip (each cell is one hidden unit). The bottom shows probability bars for the top-10 next characters. Click "Generate" to start auto-generating. Use the temperature slider to control creativity. Pause and resume at any time.
Type a seed, then click "Generate" to watch the RNN extend it character by character. The hidden state heatmap and probability bars update in real time. Adjust temperature to control creativity.
Try different temperatures. At 0.2, the model repeats common patterns ("the the the..."). At 1.0, it produces mostly-English-like text with occasional errors. At 2.0, it becomes creative but makes frequent mistakes. This temperature-quality tradeoff is universal across all language models.
Watch the heatmap as the model generates. Some hidden units "light up" consistently after spaces (word boundary detectors). Others activate inside quotes. Others track whether we're in the middle of a word or at the start. The network discovered these features automatically through training — nobody programmed them.
Karpathy found individual hidden units that tracked parenthesis depth, quote state, line position, and even whether the current text was a comment in code. The RNN learned useful representations of text structure from raw characters alone.
Word-level language models have a fixed vocabulary. If the model has never seen "TensorFlow" during training, it can't generate or process it. Character-level models work with just ~100 characters (letters, digits, punctuation). They can generate any word — including novel ones — character by character.
The tradeoff: character-level models need much longer sequences to capture the same context. "The cat sat on the mat" is 6 tokens at the word level but 26 characters. The RNN must carry information through 4x more timesteps. This makes the vanishing gradient problem even worse — which is why character-level models really need LSTMs, not vanilla RNNs.
| Aspect | Word-Level | Character-Level |
|---|---|---|
| Vocabulary | 50,000+ words | ~100 characters |
| OOV problem | Yes — unseen words fail | No — can generate any string |
| Sequence length | Short (20-50 steps) | Long (100-500 steps) |
| Output layer | Huge softmax (50K classes) | Tiny softmax (~100 classes) |
| Training | Faster per epoch | Slower (longer sequences) |
| Modern replacement | Subword tokenization (BPE) — the best of both worlds | |
The model is trained with cross-entropy loss at each timestep: given the true next character, maximize its log probability. In code:
python # Training loop for char-level RNN for batch in data_loader: inputs, targets = batch # [batch, seq_len] character indices logits, hidden = model(inputs, hidden) # Reshape for cross-entropy: flatten time dimension loss = F.cross_entropy( logits.view(-1, vocab_size), # [batch*seq_len, vocab_size] targets.view(-1) # [batch*seq_len] ) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0) optimizer.step() hidden = hidden.detach() # truncated BPTT: stop gradient at chunk boundary
The hidden.detach() call is crucial. It carries the hidden state forward (so the model sees long-range context) but cuts the gradient (so backprop only flows through the current chunk). This is truncated BPTT in action.
The quality of generated text depends heavily on three factors:
1. Training data quality and quantity. Karpathy trained on 1MB of Shakespeare (~500K characters). The model learned iambic pentameter, character names, stage directions. With Linux source code, it learned to open and close brackets, write comments, and use proper indentation. The model mirrors its training data's style with surprising fidelity.
2. Hidden state size. A hidden state of 64 units can learn basic word patterns but not long-range structure. 256 units can track sentence-level structure. 512–1024 units (the typical range for serious char-level models) can learn paragraph-level coherence and even some semantic patterns.
3. Architecture depth. A single-layer LSTM generates locally coherent text. A 2–3 layer stacked LSTM generates text with better long-range structure, proper paragraph transitions, and more consistent style. Beyond 3 layers, gains diminish and training becomes difficult.
Translate: "The agreement on the European Economic Area was signed in August 1992." The LSTM encoder reads all 13 words, updating its hidden state at each step. After the last word, it passes the final hidden state to the decoder. All 13 words of meaning, syntax, entities, and structure — crammed into one vector of 512 numbers.
This is the encoder-decoder architecture (also called sequence-to-sequence or seq2seq) for tasks where input and output are both sequences but of different lengths. Sutskever et al. (2014) showed this works for machine translation, dialogue, and summarization. The encoder reads the source sentence and produces a fixed-size vector. The decoder generates the target sentence one word at a time, conditioned on that vector.
The encoder is an LSTM that reads the input left-to-right. Its final hidden state hT is called the context vector or thought vector — a compressed representation of the entire input. The decoder is a separate LSTM initialized with this context vector. At each step, it predicts the next output word given the context and its own previous outputs.
python class Seq2Seq(nn.Module): def __init__(self, src_vocab, tgt_vocab, d_embed, d_hidden): super().__init__() self.encoder = nn.LSTM(d_embed, d_hidden, batch_first=True) self.decoder = nn.LSTM(d_embed, d_hidden, batch_first=True) self.src_embed = nn.Embedding(src_vocab, d_embed) self.tgt_embed = nn.Embedding(tgt_vocab, d_embed) self.output = nn.Linear(d_hidden, tgt_vocab) def forward(self, src, tgt): # Encode: [batch, src_len] → final hidden state _, (h, c) = self.encoder(self.src_embed(src)) # Decode: use encoder's final state as initial state out, _ = self.decoder(self.tgt_embed(tgt), (h, c)) return self.output(out) # [batch, tgt_len, tgt_vocab]
The bottleneck is obvious: no matter how long the input is, the decoder only gets one vector (h, c). A 5-word sentence? One vector. A 50-word sentence? Same vector. Information gets crushed. In practice, encoder-decoder models with LSTMs work well for short sentences (~20 words) but degrade rapidly on longer inputs. Cho et al. (2014) showed BLEU scores dropping dramatically as source sentence length increased beyond 20 words — the model just couldn't compress that much information.
Sutskever et al. discovered a useful trick: reversing the input sentence. Instead of feeding "the cat sat" as [the, cat, sat], they fed it as [sat, cat, the]. This way, the first words of the input (which are most important for generating the first words of the output) are closest in time to the decoder. This reduced the average distance between corresponding input and output words and improved BLEU scores by several points. But it's a hack — it doesn't solve the fundamental bottleneck.
How much information can a 512-dimensional vector hold? In floating-point representation, each number carries about 23 bits of precision. So a 512-dim vector holds roughly 512 × 23 = 11,776 bits. A typical English sentence has about 5 bits per character (Shannon's estimate), or about 50 bits per word. That means a 512-dim vector can theoretically encode about 235 words of information — enough for short sentences.
But the LSTM doesn't achieve optimal compression. The hidden state must encode not just the meaning of the words but their syntactic structure, word order, entity relationships, and discourse context. In practice, Cho et al. (2014) found performance degrading severely beyond 20 words. The effective capacity is far less than the theoretical maximum.
Attention solves this by providing a "shortcut" — the decoder doesn't need to extract everything from one vector because it can access the encoder states directly. The fixed-size bottleneck becomes an optional summary rather than the sole channel of information.
Below, drag the sentence length slider to see how the "compression pressure" changes. At 5 words, the bottleneck is comfortable. At 30 words, it's glowing red — too much information for one vector. Then click "Toggle Attention" to see what happens when the decoder can look directly at every encoder position instead of relying on a single summary vector.
Slide to change sentence length. Watch the bottleneck pressure indicator. Click "Toggle Attention" to add direct connections from decoder to all encoder positions.
The idea (Bahdanau et al., 2015): instead of compressing the entire input into one vector, let the decoder look back at every encoder hidden state. At each decoding step, compute a weighted sum of all encoder hidden states, where the weights represent "how much should I focus on each input word right now?"
At timestep t, the decoder computes a score between its current state htdec and every encoder state hsenc. These scores are softmaxed into attention weights α. The context vector is a weighted average of encoder states. The decoder uses this context vector (plus its own state) to predict the next word.
This changes everything. When translating "Area" into French, the decoder can attend directly to "Area" and "Economic" in the source, ignoring "August" and "1992." Each output word attends to exactly the input words that are relevant to it. No more bottleneck.
The progression is clean:
| Architecture | Year | Context Mechanism | Limitation |
|---|---|---|---|
| N-gram | 1990s | Count previous n−1 words | Sparsity, no generalization |
| Neural LM | 2003 | Fixed window of embeddings | Window size is fixed |
| RNN | 2010 | Hidden state carries memory | Vanishing gradients, sequential |
| LSTM/GRU | 1997/2014 | Gated cell state + hidden state | Still sequential, bottleneck at long range |
| RNN + Attention | 2015 | Look back at all encoder states | Still sequential computation |
| Transformer | 2017 | Self-attention — attention only, no RNN | O(n²) in sequence length |
The Transformer took the attention mechanism and said: "Why do we need the RNN at all? What if attention IS the architecture?" By removing the sequential dependency, Transformers can process all positions in parallel on GPUs. Training time dropped by an order of magnitude. That's what unlocked GPT, BERT, and every large language model since.
The core attention computation is surprisingly simple:
python def attention(decoder_hidden, encoder_outputs): # decoder_hidden: [batch, d_h] — current decoder state # encoder_outputs: [batch, src_len, d_h] — all encoder hidden states # Score: dot product between decoder state and each encoder state scores = torch.bmm( encoder_outputs, # [batch, src_len, d_h] decoder_hidden.unsqueeze(2) # [batch, d_h, 1] ).squeeze(2) # [batch, src_len] # Normalize to attention weights attn_weights = F.softmax(scores, dim=1) # [batch, src_len] # Weighted sum of encoder outputs context = torch.bmm( attn_weights.unsqueeze(1), # [batch, 1, src_len] encoder_outputs # [batch, src_len, d_h] ).squeeze(1) # [batch, d_h] return context, attn_weights
Three lines of math: score, softmax, weighted sum. That's it. This simple mechanism solved the bottleneck problem and opened the door to Transformers. In the next lecture (L05), we'll see how self-attention — where the query, key, and value all come from the same sequence — replaces recurrence entirely.
The dot product score above is just one option. The three main scoring functions are:
| Name | Formula | Notes |
|---|---|---|
| Dot product | hdecT henc | Simplest, fastest. Requires same dimensionality. |
| Multiplicative | hdecT Wa henc | Learnable Wa adds flexibility. Luong (2015). |
| Additive | vaT tanh(W1hdec + W2henc) | Most flexible. Bahdanau (2015). Two matrices + a vector. |
The Transformer uses scaled dot product attention: divide the dot product by √dk to prevent the softmax from becoming too peaked in high dimensions. This subtle scaling is critical for training stability.
Where Q (queries), K (keys), and V (values) are all derived from the input. In self-attention, Q, K, and V all come from the same sequence, transformed by different learned weight matrices. This allows each position to attend to every other position — and all positions are processed in parallel. No sequential bottleneck, no vanishing gradients across distance, just direct connections. That's the revolution that the next lecture covers.
We've traced the evolution from the simplest possible language model (counting word pairs) through neural language models, recurrent neural networks, and gated architectures. Each solved a problem the previous one couldn't — and each introduced a new limitation that drove the next innovation.
These papers provide the theoretical and experimental foundation for this lesson:
Learning Long-Term Dependencies with Gradient Descent is Difficult (Bengio et al., 1994) — The paper that formally proved why vanilla RNNs can't learn long-range dependencies. Showed that the gradient signal decays (or explodes) exponentially with the number of timesteps, making it theoretically impossible for gradient-based methods to connect distant events in a sequence.
On the Difficulty of Training Recurrent Neural Networks (Pascanu et al., 2013) — A more detailed analysis of the gradient flow problem, with practical solutions including gradient clipping and the echo state property. Showed that the problem is fundamentally about the spectral radius of the recurrence weight matrix.
Attention Is All You Need (Vaswani et al., 2017) — The paper that replaced RNNs entirely with self-attention. Showed that attention alone (no recurrence, no convolution) achieves state-of-the-art on machine translation while training orders of magnitude faster. The birth of the Transformer.
RNNs have been largely replaced by Transformers for language tasks, but they're not dead:
| Domain | Why RNNs/LSTMs persist |
|---|---|
| Edge devices | LSTM models are tiny (1-10M params) and can run on microcontrollers |
| Time series | Real-time sensor data, financial forecasting — streaming, sequential nature fits RNNs |
| Reinforcement learning | LSTM policies are common in robotics and game agents (smaller state spaces) |
| State space models | Mamba (Gu & Dao, 2023) recasts recurrence as selective SSMs — linear-time attention alternative |
| Topic | Why It Matters | Where to Learn More |
|---|---|---|
| Bidirectional RNNs | Read left-to-right AND right-to-left, concat both hidden states. Better representations but can't be used for language generation. | CS224N L05 (ELMo, BERT) |
| Subword tokenization | BPE and WordPiece solve the OOV problem without going to character-level. Used by all modern models. | Tokenization deep-dive |
| Dropout for RNNs | Naive dropout breaks temporal coherence. Variational dropout (same mask at every timestep) is required. | Gal & Ghahramani, 2016 |
| State space models | Mamba, S4, and linear attention bring back the recurrent paradigm with O(n) complexity and no vanishing gradients. | SSM & Mamba lesson |
← L03: Backpropagation and Neural Networks — The training algorithm that makes all of this possible.
L05: Transformers and Self-Attention → — Removing recurrence entirely. Attention is all you need. (Coming soon)
Penn Treebank (PTB) perplexity scores show the progression clearly:
| Year | Model | Test Perplexity |
|---|---|---|
| 2003 | KN 5-gram (count-based) | 141 |
| 2012 | RNN-LM (Mikolov) | 124 |
| 2014 | LSTM (Zaremba, large) | 82 |
| 2017 | AWD-LSTM (Merity) | 58 |
| 2018 | Transformer-XL | 24 |
| 2019 | GPT-2 (1.5B) | ~18 |
Each generation — n-grams to RNNs to LSTMs to Transformers — brought a dramatic improvement. The jump from LSTM to Transformer was particularly striking because it came not from a better learning algorithm, but from a better architecture: one that could be parallelized on GPUs and scale to billions of parameters.
Notice the acceleration: count-based models dominated for ~15 years (1990–2005). Neural LMs for ~7 years (2003–2010). Vanilla RNNs for ~4 years (2010–2014). LSTMs for ~3 years (2014–2017). Transformers have dominated since 2017 — but even within Transformers, the pace of change is extraordinary: BERT (2018), GPT-2 (2019), GPT-3 (2020), PaLM/Chinchilla (2022), GPT-4 (2023), Llama 3 (2024). Each brought qualitative leaps in capability.
| Era | Key Insight |
|---|---|
| N-grams | Language has statistical regularities that can be captured by counting. |
| Neural LMs | Words have continuous representations; similar words should share predictions. |
| RNNs | Networks can have memory; the same computation applied at every step can process variable-length sequences. |
| LSTMs | Additive memory updates (instead of multiplicative) allow information to persist over long distances. |
| Attention | Direct connections between distant positions are better than passing information through a chain. |
| Transformers | Attention alone is sufficient. Remove recurrence and everything can be parallelized. |
Each of these insights seems obvious in retrospect. That's the mark of a great idea — it's hard to see before someone shows you, and impossible to unsee after.
Even though Transformers have replaced RNNs for most language tasks, understanding RNNs is essential for three reasons:
First, RNNs are the simplest model that handles variable-length sequences. They strip away the complexity of multi-head attention, positional encodings, and layer normalization to reveal the core challenge: how do you carry information through a sequence?
Second, the vanishing gradient problem is not specific to RNNs. It appears in any deep computation graph. Understanding it in the RNN context — where it's most dramatic and easiest to visualize — builds intuition that transfers to understanding residual connections, gradient highways, and normalization in Transformers.
Third, the gate mechanism invented for LSTMs reappears everywhere in modern deep learning. Mixture-of-experts routing uses gating. Highway networks use gates. The GRU-like update in state space models (Mamba) is a direct descendant of the LSTM update gate. Understanding gates here means understanding them everywhere.
The history of NLP is not a sequence of dead ends. It's a sequence of insights, each building on the last. N-grams taught us that statistics matter. Neural LMs taught us that representations matter. RNNs taught us that memory matters. LSTMs taught us that selective memory matters. Attention taught us that direct access beats sequential propagation. And Transformers taught us that scale matters. Every lesson in that chain is load-bearing.