CS224N Lecture 4 — Language Models and RNNs

Chapter 0: Why Predict the Next Word?

Your phone keyboard just predicted "you" after "thank." It assigned probabilities to every word in its vocabulary and picked the highest ones to show you. That prediction engine is a language model.

A language model answers one question: given the words so far, what comes next? More precisely, it computes a probability distribution over the entire vocabulary for the next word. "Thank" is followed by "you" with probability 0.62, "goodness" with probability 0.08, "god" with probability 0.05, and so on through every word in the dictionary.

Why does this matter? Because nearly every NLP application you've ever used — translation, summarization, chatbots, code completion, search autocomplete — is, at its core, a language model. Google Translate generates translations one word at a time, each time asking "what's the most probable next word in the target language given the source sentence and the words generated so far?" ChatGPT does the same thing: it produces one token at a time, sampling from a probability distribution.

Even evaluating text quality uses language models. A well-formed English sentence should have high probability under a good language model. "The cat sat on the mat" should score higher than "The cat sat on the the the the." This is how spam filters, grammar checkers, and speech recognition systems rank candidate sentences.

Formally, a language model assigns a probability to a sequence of words:

P(w₁, w₂, ..., w_n)

By the chain rule of probability, we can decompose this into a product of conditional probabilities:

P(w₁, ..., w_n) = ∏_i=1ⁿ P(w_i | w₁, ..., w_i−1)

This is exact — no approximation. The entire challenge of language modeling is computing those conditional probabilities well. How do you estimate P("mat" | "the cat sat on the")? That's the question every model in this lesson tries to answer.

A Worked Example

Let's compute the probability of a tiny sentence: "the cat sat." Using the chain rule:

P("the cat sat") = P("the") × P("cat" | "the") × P("sat" | "the cat")

If P("the") = 0.07 (the most common English word), P("cat" | "the") = 0.002, and P("sat" | "the cat") = 0.05, then:

P("the cat sat") = 0.07 × 0.002 × 0.05 = 0.000007

That's tiny — and this is only three words. For a 20-word sentence, you're multiplying 20 small numbers together. The result is astronomically small. This is why we work with log probabilities in practice: log(0.000007) = −11.87. Sums of logs are much more numerically stable than products of tiny numbers.

Why This Is Hard

The full conditional P(w_i | w₁, ..., w_i−1) depends on all previous words. For a sentence with a vocabulary of 50,000 words, at position 10, we'd need to estimate a distribution conditioned on all possible 9-word histories. That's 50,000⁹ possible histories — far more than any corpus could cover. We need approximations.

Every chatbot, translator, and code assistant is, at its core, a language model — a machine that assigns probabilities to sequences of words. The entire history of NLP, from n-grams in the 1990s to GPT-4 in 2024, is the story of better ways to compute P(next word | context).

Let's see this in action. Below is a simulated phone keyboard. A partial sentence is shown at the top, and a bar chart displays the probabilities of the top candidate next words. Click a word to extend the sentence and watch the distribution shift.

Phone Keyboard Autocomplete

Click a word to extend the sentence. The probability bars update based on context. Click "Reset" to start over.

Click a word bar to extend the sentence.

Notice what happened. Every time you picked a word, the entire distribution shifted. After "I" the model strongly favored "am," "think," "want." After "I want" it shifted to "to," "a," "the." The context determines the distribution. A language model with more context and better modeling of that context will make better predictions.

This lesson walks through three generations of language model: counting (n-grams), neural networks with fixed windows, and neural networks with memory (RNNs). Each solves a problem the previous one couldn't handle. By the end, you'll understand why RNNs were revolutionary — and why they were replaced.

What this lesson covers: N-gram counting models and their sparsity problem. Bengio's 2003 neural language model and why embeddings were a breakthrough. Recurrent neural networks and the hidden state. Training with backpropagation through time. Why vanishing gradients kill long-range memory. LSTM and GRU gates. A live character-level RNN generating text. The attention mechanism that eventually replaced RNNs entirely.

What does a language model output?

A single word — the best prediction A probability distribution over all possible next words A grammaticality score for the input sentence

Chapter 1: N-gram Models

The simplest language model just counts. How often does "the" follow "in"? Count it up, divide by how many times "in" appears total. That ratio is your probability estimate. This is a bigram model — it predicts each word from the one word before it.

The key assumption is the Markov assumption: the probability of a word depends only on the previous n−1 words, not the entire history. For a bigram (n=2), the next word depends only on the current word. For a trigram (n=3), it depends on the previous two words.

P(w_i | w₁, ..., w_i−1) ≈ P(w_i | w_i−n+1, ..., w_i−1)

For a bigram model, this simplifies to:

P(w_i | w_i−1) = count(w_i−1, w_i) / count(w_i−1)

This is just the fraction of times w_i follows w_i−1 in the training corpus. If "the" appears 10,000 times in your data, and "the cat" appears 200 times, then P(cat | the) = 200/10,000 = 0.02. Simple counting. No neural networks, no parameters to learn — just a big lookup table.

The full sentence probability under a bigram model decomposes as:

P(w₁, ..., w_n) = ∏_i=1ⁿ P(w_i | w_i−1)

For a trigram model (n=3), each word depends on the two previous words: P(w_i | w_i−2, w_i−1). The context grows, predictions improve, but the sparsity problem grows much faster.

Building a Bigram Model by Hand

Let's make this concrete. Suppose our entire corpus is four sentences:

the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog ate the bone

We count every pair of adjacent words. "the cat" appears 2 times. "the dog" appears 2 times. "the mat" appears 2 times. "the fish" appears 1 time. "the bone" appears 1 time. "the" appears 10 times total (as the first word of a pair). So P(cat | the) = 2/10 = 0.20, P(dog | the) = 2/10 = 0.20, P(mat | the) = 2/10 = 0.20, and so on.

Let's verify the probabilities sum to 1. P(cat|the) + P(dog|the) + P(mat|the) + P(fish|the) + P(bone|the) = 2/10 + 2/10 + 2/10 + 1/10 + 1/10 = 8/10 = 0.80. Wait — that's not 1.0. What's missing? The remaining 0.20 is spread across words that follow "the" in other positions: "sat" never directly follows "the" in this corpus, but "the" also starts the sentences, giving us "the cat" and "the dog." Looking more carefully, "the" appears as the first word of a bigram 10 times: "the cat" (2), "the mat" (2), "the dog" (2), "the fish" (1), "the bone" (1), and we need one more. In fact "on the" produces the remaining bigrams starting with "the." The exact accounting depends on sentence boundaries, but the principle holds: count, divide, get probabilities.

Now let's compute a sentence probability. Using our bigram model:

P("the cat sat") = P(the|<s>) × P(cat|the) × P(sat|cat) = 1.0 × 0.20 × 0.50 = 0.10

For "the fish ate": P(the|<s>) × P(fish|the) × P(ate|fish) = 1.0 × 0.10 × 0.50 = 0.05. Lower probability — "fish" follows "the" less often than "cat" does. The model captured a real (if crude) pattern from the data.

The widget below builds this bigram table step by step. Click "Next" to advance through the corpus word by word and watch the count table grow. Then click "Generate" to sample a sentence from the learned distribution.

Bigram Model Builder

Click "Next" to advance through the corpus. Watch the bigram count table build. Click "Generate" to sample a new sentence from the learned probabilities.

Step 0 — click Next to start scanning the corpus.

The widget animated the counting. Now you turn the raw counts into a probability table — the one line that IS the bigram model.

The Sparsity Problem

N-gram models have a fatal flaw: sparsity. If your training corpus never contains the phrase "quantum entanglement," then P(entanglement | quantum) = 0. Zero. And because we multiply probabilities across the sentence, one zero kills the entire sentence probability — it collapses to zero.

This isn't just a theoretical problem. Consider estimating P(food | "the chef prepared the") using a 5-gram model. You'd need to find the exact sequence "the chef prepared the" in your corpus. Even a billion-word corpus might not contain that exact 4-word prefix. As n grows, the number of possible n-grams grows as |V|ⁿ, where |V| is vocabulary size. With a 50,000-word vocabulary, the number of possible 5-grams is 50,000⁵ = 3.125 × 10²³. No corpus can cover that space.

Partial fixes exist — smoothing (add a small count to every n-gram, including unseen ones) and backoff (if the 5-gram count is zero, fall back to the 4-gram, then 3-gram, etc.). But these are patches. They don't solve the fundamental issue: similar words get no shared credit. If the model has seen "the dog ran fast" a thousand times, it learns nothing about "the puppy sprinted quickly." The words are completely different strings, and counting treats them as unrelated.

If your corpus never contains "quantum entanglement," P(entanglement | quantum) = 0. One missing pair and the entire sentence probability collapses to zero. N-gram models can't generalize to unseen word combinations — they're prisoners of their training data.

Storage Blows Up

There's a practical problem too. An n-gram model stores explicit counts for every observed n-gram. Google's n-gram dataset from 2006 contained 1 trillion words. The number of unique 5-grams in that dataset: 1.18 billion. Storing all of them with counts requires gigabytes of memory. And this is still sparse — the vast majority of possible 5-grams were never observed.

n	Name	Context Window	Unique n-grams (Google 1T)
1	Unigram	0 words	13 million
2	Bigram	1 word	314 million
3	Trigram	2 words	977 million
4	4-gram	3 words	1.31 billion
5	5-gram	4 words	1.18 billion

And even with trillions of words, most reasonable 5-grams are still unseen. The fundamental issue is that language is compositional — a small vocabulary combines into an astronomical number of valid sequences. Counting can never cover that space. We need a model that can generalize.

Smoothing: A Patch, Not a Fix

The simplest smoothing method is Laplace smoothing (add-1): add 1 to every count, including unseen n-grams. This guarantees no probability is ever zero.

P_Laplace(w_i | w_i−1) = (count(w_i−1, w_i) + 1) / (count(w_i−1) + |V|)

But this is too aggressive — it steals far too much probability mass from observed n-grams and spreads it over the astronomically large number of unseen ones. A seen bigram that appeared 100 times might have its probability cut in half.

Kneser-Ney smoothing is much better. Instead of adding a fixed count, it subtracts a small discount d (typically 0.75) from each observed count and distributes that mass using a clever continuation probability. Kneser-Ney was the gold standard for n-gram models from 1995 until neural models took over in the 2010s.

But no smoothing method can solve the fundamental problem: if the model has never seen any context similar to "the quantum computer calculated," it has no useful signal. All smoothed probabilities for unseen contexts are essentially uniform — the model is guessing randomly. We need representations that encode similarity.

N-grams in Code

python
from collections import Counter, defaultdict

class BigramLM:
    def __init__(self):
        self.bigram_counts = defaultdict(Counter)
        self.unigram_counts = Counter()

    def train(self, corpus):
        # corpus: list of sentences, each a list of words
        for sentence in corpus:
            for i in range(len(sentence) - 1):
                w1, w2 = sentence[i], sentence[i+1]
                self.bigram_counts[w1][w2] += 1
                self.unigram_counts[w1] += 1

    def prob(self, w2, w1, alpha=0.01):
        # P(w2 | w1) with add-alpha smoothing
        count = self.bigram_counts[w1][w2]
        total = self.unigram_counts[w1]
        vocab_size = len(self.unigram_counts)
        return (count + alpha) / (total + alpha * vocab_size)

Why do n-gram models fail on rare word combinations?

They need too much computation They can only handle sentences shorter than n words They estimate probability as zero for any n-gram not seen in training (sparsity)

Chapter 2: Neural Language Models

What if "cat" and "kitten" being similar words could help predictions? If the model has seen "the cat sat on the mat," it should give some probability to "the kitten sat on the mat" even without ever seeing that exact phrase. N-gram models can't do this — "cat" and "kitten" are just different strings. But a neural network can map words to continuous vectors where similar words live nearby, and then operate on those vectors instead of discrete strings.

This is exactly what Bengio et al. proposed in 2003 in a landmark paper titled "A Neural Probabilistic Language Model." The idea: replace the giant lookup table of n-gram counts with a neural network that takes a fixed window of previous words, looks up their embeddings (dense vectors of ~100-300 dimensions), and predicts the next word.

The Architecture

Here's how the fixed-window neural language model works, step by step:

1. Embedding Lookup

Each of the previous n words is mapped to a dense vector via a learned embedding matrix E. Shape: n words × d dimensions.

↓

2. Concatenate

The n embedding vectors are concatenated into one long vector. Shape: [n × d].

↓

3. Hidden Layer

A fully connected layer with tanh activation. Learns nonlinear combinations of features. Shape: [n × d] → [h].

↓

4. Output + Softmax

Another linear layer maps to vocabulary size, then softmax gives a probability distribution. Shape: [h] → [|V|].

The magic is in step 1. The embedding matrix E is a [|V| × d] matrix where each row is a word's vector. "Cat" and "kitten" end up with similar vectors because they appear in similar contexts during training. So any pattern learned about "cat" automatically applies to "kitten" — the network has generalized through the embedding space.

Below is an interactive diagram of this architecture. Hover over each component to see its tensor shape and purpose. The 4-word window feeds into embedding lookup, concatenation, a hidden layer, and finally softmax output.

Fixed-Window Neural Language Model

Hover over or tap each component to see its tensor shape and role. The input is a 4-word window; the output is a probability distribution over the vocabulary.

Neural LM vs. N-gram: A Comparison

Property	N-gram Model	Neural LM (Bengio 2003)
Representation	Words are discrete symbols	Words are continuous vectors
Generalization	None — each n-gram independent	Similar words share predictions
Unseen n-grams	Zero probability (without smoothing)	Non-zero — embeddings interpolate
Context size	n−1 words (fixed, typically 3-5)	n−1 words (fixed, typically 4-8)
Parameters	Counts table (sparse, huge)	Dense matrices (~1-10M params)
Training	Just counting	Backpropagation + gradient descent

The Window Problem

But there's a catch. The neural LM still uses a fixed window. If your window is 4 words, you can't see the 5th word back. Want more context? Make the window bigger. But each additional word adds d parameters to the concatenated vector, which means a larger weight matrix for the hidden layer. A 10-word window with 300-dimensional embeddings creates a concatenated vector of 3,000 dimensions. The hidden layer weight matrix is [3000 × h] — that's a lot of parameters and a lot of computation.

Worse, the model treats position 1 and position 4 completely differently. The weights applied to the first word in the window are entirely separate from the weights applied to the fourth word. There's no parameter sharing across positions. The model learns "what does word-in-position-1 mean for the prediction" and "what does word-in-position-4 mean for the prediction" as separate things, even though both are just "a word in the context."

The Tensor Shapes

Let's trace the exact dimensions for a concrete model. Suppose vocab_size = 10,000, embed_dim = 128, context_len = 4, hidden_dim = 256.

Component	Input Shape	Weight Shape	Output Shape
Embedding lookup	[batch, 4] (integers)	E: [10000, 128]	[batch, 4, 128]
Concatenation	[batch, 4, 128]	—	[batch, 512]
Hidden layer	[batch, 512]	W₁: [512, 256]	[batch, 256]
Output + softmax	[batch, 256]	W₂: [256, 10000]	[batch, 10000]

Total trainable parameters: 10,000 × 128 (embeddings) + 512 × 256 (hidden) + 256 × 10,000 (output) ≈ 3.97M parameters. The output layer alone is 2.56M — the most expensive part, because it maps from hidden_dim to the full vocabulary. This "softmax bottleneck" is a recurring theme: the output layer dominates parameter count for language models with large vocabularies.

Later, weight tying (sharing the embedding matrix with the output matrix) and adaptive softmax (hierarchical output for rare words) became standard tricks to reduce this cost. We'll see weight tying again in the LSTM chapter.

The window size is baked in. A 4-word window can't see the 5th word back. Want more context? You'd need a different architecture — one where the context size isn't hardwired into the number of parameters. That's exactly what RNNs provide.

python
import torch
import torch.nn as nn

class NeuralLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, context_len):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)      # [|V|, d]
        self.hidden = nn.Linear(context_len * embed_dim, hidden_dim) # [n*d, h]
        self.output = nn.Linear(hidden_dim, vocab_size)              # [h, |V|]

    def forward(self, x):
        # x: [batch, context_len] — integer word indices
        e = self.embeddings(x)         # [batch, context_len, embed_dim]
        e = e.view(e.size(0), -1)      # [batch, context_len * embed_dim] — concat
        h = torch.tanh(self.hidden(e))  # [batch, hidden_dim]
        logits = self.output(h)         # [batch, vocab_size]
        return logits                    # apply softmax for probabilities

Notice how context_len is baked into the hidden layer's input dimension. Changing the context length means rebuilding the entire model. We need an architecture where context length is variable.

Why This Was Revolutionary

Bengio's 2003 paper was a watershed moment. Before it, NLP was entirely count-based. The paper showed three things:

First, that word embeddings — dense vectors learned from data — could capture semantic similarity automatically. Words appearing in similar contexts got similar vectors, without anyone telling the model that "cat" and "kitten" are related.

Second, that a neural network could outperform n-gram models on perplexity, the standard language model metric. This was controversial at the time — many researchers believed that decades of engineering on smoothing and backoff had brought count-based models to near-optimal performance. Bengio showed there was a lot more room for improvement.

Third, and most importantly, it proved that neural language models could generalize to unseen word combinations. If the model learned that "the dog ran in the park" is probable, it automatically assigned some probability to "the cat ran in the park" because "dog" and "cat" have similar embeddings. N-gram models could never do this.

The embedding matrix is the first neural NLP breakthrough. It maps discrete symbols (word IDs) to continuous vectors where similarity has meaning. This exact idea — learned embeddings — is still at the heart of every language model in 2025, including GPT-4 and Llama.

What problem do word embeddings solve that n-grams can't?

Generalization — similar words share predictions through continuous vector space Speed — neural models are always faster than counting Memory — neural models always use less storage

Chapter 3: Recurrent Neural Networks

What if the network had memory? Instead of cramming the last n words into a fixed-size window, it reads one word at a time and maintains an internal state — a running summary of everything it has seen so far. That's the core idea of a recurrent neural network (RNN).

At each timestep t, the RNN takes two inputs: the current word x_t (as an embedding vector) and the previous hidden state h_t−1 (the memory). It combines them to produce a new hidden state h_t, which serves as the memory for the next timestep. The hidden state is initialized to zeros at the start of a sequence.

This is the Elman network, proposed by Jeff Elman in 1990. It's the simplest possible recurrent architecture, and it remained the default for 7 years until LSTMs replaced it in 1997. The key insight: by feeding the hidden state back into itself, the network can theoretically remember information from arbitrarily far in the past. The word "theoretically" is doing a lot of work here — we'll see why in Chapter 5.

h_t = tanh(W_h h_t−1 + W_x x_t + b)

Let's unpack each piece:

W_h is the hidden-to-hidden weight matrix [d_h × d_h]. It determines how the previous memory is transformed. Think of it as the network's "memory filter" — which parts of the old memory matter for the new state.

W_x is the input-to-hidden weight matrix [d_h × d_x]. It determines how the current word influences the new state. Think of it as the "new information integrator."

b is a bias vector [d_h], and tanh squashes the result to [−1, 1], keeping values bounded.

To predict the next word, we pass h_t through a linear layer and softmax:

ŷ_t = softmax(W_y h_t + b_y)

Here ŷ_t is a probability distribution over the entire vocabulary. We pick the word with the highest probability (or sample from the distribution).

Parameter sharing: the SAME W_h and W_x are used at every timestep. The model doesn't grow with input length. Whether you feed it 5 words or 500 words, the number of parameters stays the same. This is the fundamental advantage over fixed-window models.

The Unrolled View

An RNN is often drawn as a loop, but it's easier to understand when unrolled across time. Each timestep is a copy of the same network with the same weights, connected by the hidden state flowing left to right.

Below, click "Next Word" to advance through a sentence one word at a time. Watch the hidden state (shown as a heatmap strip) update at each step. The arrow from the previous h to the current h shows the recurrence — that's what gives the RNN its memory.

RNN Unrolled Across Time

Click "Next Word" to advance. The hidden state heatmap shows how the internal representation evolves. The recurrent arrow carries memory from each step to the next.

Step 0 — hidden state is initialized to zeros.

What Lives in the Hidden State?

The hidden state h_t is the RNN's compressed representation of the entire sequence so far. After processing "the cat sat on the," h₅ contains some encoding of all five words — enough information (we hope) to predict that "mat" is likely next.

But it's a fixed-size vector. Whether the sentence is 5 words or 50 words, h_t is always the same dimensionality (typically 128–1024). The RNN must compress everything into this bottleneck. Short sentences? Easy. Long sentences? The early words get overwritten as new words push in. We'll see exactly why this is a problem in Chapter 5.

Think of the hidden state as a whiteboard. At each timestep, the RNN reads the current word and updates the whiteboard. But the whiteboard has a fixed size. Early on, there's plenty of room. After 50 words, the whiteboard is full and every new word requires erasing something old. The question is: does the RNN erase the right things? Vanilla RNNs are bad at this — they tend to overwrite important early information with less important recent information. LSTMs fix this with explicit gates that control what gets erased.

RNNs for Language Modeling

To use an RNN as a language model, we feed it one word at a time and ask it to predict the next word at each step. The loss is computed at every timestep:

L = −(1/T) ∑_t=1^T log P(w_t+1 | h_t)

Where h_t is the hidden state after processing w₁, ..., w_t, and P(w_t+1 | h_t) comes from the softmax output layer. This is the standard language model training objective — minimize cross-entropy between predicted and true next words.

An RNN From Scratch

python
import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)   # [|V|, d_x]
        self.W_h = nn.Linear(hidden_dim, hidden_dim)        # h_{t-1} → h_t
        self.W_x = nn.Linear(embed_dim, hidden_dim)         # x_t → h_t
        self.W_y = nn.Linear(hidden_dim, vocab_size)        # h_t → logits
        self.hidden_dim = hidden_dim

    def forward(self, x, h=None):
        # x: [batch, seq_len] — integer indices
        batch = x.size(0)
        if h is None:
            h = torch.zeros(batch, self.hidden_dim)  # h_0 = 0

        outputs = []
        for t in range(x.size(1)):              # loop over timesteps
            x_t = self.embed(x[:, t])               # [batch, d_x]
            h = torch.tanh(self.W_h(h) + self.W_x(x_t))  # [batch, d_h]
            outputs.append(h)

        h_stack = torch.stack(outputs, dim=1)    # [batch, seq_len, d_h]
        logits = self.W_y(h_stack)                  # [batch, seq_len, |V|]
        return logits, h

Key observation: the for loop over timesteps is sequential. Step t depends on step t−1. Unlike a feedforward network where all computations can be parallelized, the RNN must process the sequence in order. This sequential bottleneck is one reason RNNs were eventually replaced by Transformers (which process all positions in parallel).

A Concrete Numerical Example

Let's trace an RNN with hidden_dim = 3 and embed_dim = 2 through two timesteps. This is artificially tiny, but it makes the math concrete.

Suppose our word embeddings are: "the" = [0.5, −0.3], "cat" = [0.8, 0.2]. And our weight matrices are:

W_h = [[0.1, 0.2, −0.1], [0.3, −0.2, 0.4], [0.0, 0.5, −0.3]]
W_x = [[0.4, −0.2], [0.1, 0.6], [−0.3, 0.5]]
b = [0.0, 0.0, 0.0]

Step 1 (x = "the"): h₀ = [0, 0, 0] (initialized to zeros).

W_hh₀ + W_xx₁ = [0,0,0] + [0.4(0.5) + (−0.2)(−0.3), 0.1(0.5) + 0.6(−0.3), (−0.3)(0.5) + 0.5(−0.3)]
= [0, 0, 0] + [0.26, −0.13, −0.30] = [0.26, −0.13, −0.30]
h₁ = tanh([0.26, −0.13, −0.30]) = [0.254, −0.129, −0.291]

Step 2 (x = "cat"): Now h₁ feeds back in through W_h.

W_hh₁ = [0.1(0.254) + 0.2(−0.129) + (−0.1)(−0.291), ...] = [0.029, −0.014, 0.023]
W_xx₂ = [0.4(0.8) + (−0.2)(0.2), 0.1(0.8) + 0.6(0.2), ...] = [0.28, 0.20, −0.14]
h₂ = tanh([0.029 + 0.28, −0.014 + 0.20, 0.023 + (−0.14)]) = tanh([0.309, 0.186, −0.117])
h₂ = [0.299, 0.184, −0.117]

Notice how h₂ is different from h₁ — it encodes information from both "the" and "cat." The contribution of "the" is indirect, filtered through W_h. After many more timesteps, the influence of "the" would be filtered through W_h many times — which is exactly why gradients vanish.

The widget at the top of this chapter animated the hidden state evolving. Now build the recurrence yourself with the exact W_h, W_x, and embeddings from the worked example above — and watch what happens when you forget the memory term.

Multi-Layer (Stacked) RNNs

A single-layer RNN computes one hidden state per timestep. A stacked RNN feeds the output of layer 1 into layer 2, and so on. Each layer learns increasingly abstract representations.

h_t⁽¹⁾ = tanh(W_h⁽¹⁾ h_t−1⁽¹⁾ + W_x⁽¹⁾ x_t)
h_t⁽²⁾ = tanh(W_h⁽²⁾ h_t−1⁽²⁾ + W_x⁽²⁾ h_t⁽¹⁾)

Layer 1 might learn surface-level patterns (word boundaries, part of speech). Layer 2 might learn syntactic patterns (phrase structure, subject-verb agreement). In practice, 2–3 layers work best for RNN language models. More layers make vanishing gradients worse (gradients must flow through both time AND depth), so residual connections are often added between layers.

The best RNN language model (AWD-LSTM, Merity et al. 2018) used 3 LSTM layers with 1150 hidden units each, plus dropout between layers, weight tying (sharing the embedding and output weight matrices), and weight dropout within the LSTM. These regularization tricks were essential — without them, deep RNNs overfit badly.

Advantages and Limitations

Advantage	Limitation
Can handle any input length	Sequential processing — can't parallelize across timesteps
Parameter sharing across time	Fixed-size hidden state is a bottleneck for long sequences
Captures word order (unlike bag-of-words)	Vanishing gradients limit effective memory to ~5-10 steps
Simple architecture, easy to implement	Slow training on long sequences (no GPU parallelism in time)

What carries information from earlier words to later predictions in an RNN?

The embedding matrix — it stores all word relationships The hidden state — it's updated at each timestep and passed forward The softmax output — it feeds back into the next step

Chapter 4: Training RNNs: BPTT

How do you train a network that recycles the same weights at every step? You unroll it across time and apply backpropagation through the entire chain. This is called Backpropagation Through Time (BPTT).

At each timestep t, the RNN produces a prediction ŷ_t and incurs a loss (typically cross-entropy between the predicted distribution and the true next word). The total loss is the sum (or average) across all timesteps:

L = (1/T) ∑_t=1^T L_t = −(1/T) ∑_t=1^T log P(w_t | w₁, ..., w_t−1)

This is the negative log-likelihood of the training data under the model — the standard language model training objective. Lower is better. A perfect model that assigns probability 1.0 to every true next word achieves loss 0.

The BPTT Procedure

1. Forward pass

Process the entire sequence left-to-right, computing h_t and ŷ_t at each step. Store all intermediate values.

↓

2. Compute losses

Cross-entropy loss at each timestep t between ŷ_t and the true word w_t+1.

↓

3. Backward pass

Backpropagate gradients from the last timestep T to the first. Gradients flow backward through the unrolled chain.

↓

4. Accumulate gradients

Since the same W_h, W_x are used at every step, their gradients are SUMMED across all timesteps.

↓

5. Update weights

Apply one gradient descent step to update all parameters.

Step 4 is the crucial insight. Because W_h appears at every timestep, we compute ∂L/∂W_h by summing the contributions from every single timestep:

∂L/∂W_h = ∑_t=1^T ∂L_t/∂W_h

Each term ∂L_t/∂W_h requires backpropagating through all timesteps from t back to 1. The gradient from the loss at step T must travel backward through T matrix multiplications to reach step 1. For a 100-word sentence, that's 99 multiplications by the Jacobian ∂h_t/∂h_t−1.

Compare this to a feedforward network, where each weight matrix appears exactly once. In a feedforward net with L layers, the gradient passes through L matrix multiplications. In an RNN processing T timesteps, the gradient passes through T multiplications — and T can be much larger than L. A typical feedforward net has 3–10 layers. An RNN might process 100+ timesteps. That's an order-of-magnitude more multiplications for the gradient to survive.

Below, watch the forward and backward passes animate through an unrolled RNN. Click "Forward" to see activations flow left-to-right (teal). Click "Backward" to see gradients flow right-to-left (orange). Notice how the gradient intensity fades as it travels further back in time.

BPTT: Forward and Backward Passes

Click "Forward" to animate the forward pass (teal, left-to-right). Click "Backward" to animate the backward pass (orange, right-to-left). Watch how gradient strength fades at earlier timesteps.

Ready. Click "Forward" to start the forward pass.

Truncated BPTT

In practice, documents can be thousands of words long. Full BPTT through the entire document would require storing all hidden states (memory) and multiplying gradients through thousands of steps (vanishing/exploding gradients). The solution: truncated BPTT. Break the document into chunks of ~35 words. Run forward through a chunk. Backpropagate only through that chunk. But carry the hidden state h forward to the next chunk (without backpropagating through it). This means the hidden state "sees" the full history, but gradients only flow back ~35 steps.

Why 35? It's a tradeoff. Shorter chunks mean less memory usage and faster backward passes, but the model can only learn dependencies within the chunk length. Longer chunks allow learning longer dependencies but use more memory (you must store all intermediate hidden states for backprop). The value 35 was found empirically to work well for word-level language models on standard benchmarks.

Chunk 1 (words 1–35)

Forward pass computes h₁…h₃₅. Backward pass flows gradients back through all 35 steps. h₃₅ is saved.

↓ carry h₃₅ (no gradient)

Chunk 2 (words 36–70)

Forward starts from h₃₅ (detached). Backward flows gradients back through steps 36–70 only. Can't reach step 1.

↓ carry h₇₀ (no gradient)

Chunk 3 (words 71–105)

And so on. Hidden state carries long-range context; gradients are short-range only.

This is a compromise: the model's forward pass has unlimited context (the hidden state carries information from the beginning of the document), but the backward pass is limited to chunk-length dependencies. The model can use long-range information but has difficulty learning long-range dependencies from scratch.

Memory Requirements

Full BPTT requires storing the hidden state at every timestep for the backward pass. For a sequence of length T with hidden_dim d and batch_size B:

Memory = T × B × d × 4 bytes (float32)

For T=1000, B=64, d=512: that's 1000 × 64 × 512 × 4 = 131 MB just for hidden states. Plus the embeddings, gradients, and optimizer states. Truncated BPTT with chunk size 35 reduces this by ~30x to a manageable 4.6 MB. This is why truncated BPTT is not optional for real training — it's a practical necessity.

The gradient from the last word must pass through 99 matrix multiplications to influence processing of the first word. Each multiplication involves the Jacobian of tanh(W_h h + ...), which depends on the eigenvalues of W_h. If those eigenvalues are less than 1, the gradient shrinks exponentially. That's the vanishing gradient problem — and it's the topic of the next chapter.

BPTT in Code

python
def train_step(model, sequence, optimizer):
    # sequence: [batch, seq_len+1] — input is seq[:-1], target is seq[1:]
    inputs = sequence[:, :-1]   # [batch, T]
    targets = sequence[:, 1:]   # [batch, T]

    optimizer.zero_grad()

    # Forward: RNN processes all T timesteps
    logits, _ = model(inputs)    # [batch, T, |V|]

    # Loss: cross-entropy at each position, averaged
    loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),  # [batch*T, |V|]
        targets.view(-1)                     # [batch*T]
    )

    # Backward: PyTorch unrolls the graph and backprops through time
    loss.backward()

    # Clip gradients to prevent explosions
    torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)

    optimizer.step()
    return loss.item()

The loss.backward() call does all the heavy lifting. PyTorch recorded every operation during the forward pass. Now it unrolls that computation graph backward, computing gradients at every timestep and accumulating them for the shared weights W_h and W_x. This is BPTT — the for loop over timesteps during the forward pass becomes a reversed for loop during backward.

Perplexity: Evaluating Language Models

How do we know if a language model is good? The standard metric is perplexity (PPL):

PPL = exp(L) = exp(−(1/T) ∑_t log P(w_t | context))

Perplexity is the exponentiated average cross-entropy loss. Intuitively, it measures how "surprised" the model is by the test data. A perplexity of 100 means the model is, on average, as uncertain as if it were choosing uniformly among 100 words at each position. Lower is better. A perfect model would have perplexity 1 (no surprise at all). State-of-the-art RNN language models in 2018 achieved perplexity ~60 on the Penn Treebank benchmark. Modern Transformers achieve ~15–20.

Perplexity by Example

Suppose our model processes the sentence "the cat sat" and assigns these probabilities to each word:

Position	True word	Model probability	−log(p)
1	"the"	0.07	2.66
2	"cat"	0.002	6.22
3	"sat"	0.05	3.00

Cross-entropy = (2.66 + 6.22 + 3.00) / 3 = 3.96
Perplexity = exp(3.96) = 52.5

The model is as uncertain as if it were choosing uniformly among ~53 words. Not terrible for a small model, but far from the human level of ~10 (humans are good at predicting the next word). A well-trained LSTM achieves perplexity ~60 on PTB, meaning it effectively narrows down each word to ~60 candidates. GPT-4 likely achieves perplexity <10 on most English text.

An important note: perplexity is only comparable across models evaluated on the same test set with the same vocabulary. A model with a 10K vocabulary will have lower perplexity than one with 50K simply because it has fewer options at each step. Always report vocabulary size alongside perplexity numbers.

Another common metric is bits per character (BPC), used for character-level models. BPC = cross-entropy loss / ln(2). A BPC of 1.2 means the model needs on average 1.2 bits to encode each character. The theoretical minimum for English text is about 1.0–1.3 BPC (Shannon's estimate). State-of-the-art character models achieve ~1.08 BPC.

Why must gradients travel through every timestep during BPTT?

Because each timestep uses a different set of weights Because weight sharing means the same W_h affects all timesteps, so gradients must accumulate across time Because the loss is only computed at the last timestep

Chapter 5: The Vanishing Gradient Problem

Multiply 0.9 by itself 100 times: 0.000027. Multiply 1.1 by itself 100 times: 13,781. That's the vanishing and exploding gradient problem in two numbers.

In BPTT, the gradient of the loss at timestep T with respect to the hidden state at timestep t involves a product of Jacobians:

∂L_T/∂h_t = ∂L_T/∂h_T · ∏_k=t+1^T ∂h_k/∂h_k−1

Each factor ∂h_k/∂h_k−1 is the Jacobian of the recurrence h_k = tanh(W_h h_k−1 + ...). For a simplified 1D case, this is just W_h · tanh'(z). Since tanh'(z) ≤ 1, if |W_h| < 1, each factor has magnitude < 1. Multiply T − t of these factors and the product decays exponentially.

More precisely, the product's magnitude is governed by the largest eigenvalue of the weight matrix W_h. If the largest eigenvalue λ_max < 1, gradients vanish. If λ_max > 1, gradients explode. The sweet spot λ_max = 1 is unstable — the slightest perturbation pushes you toward vanishing or exploding.

||∂h_T/∂h_t|| ≤ (λ_max)^T−t

What This Means in Practice

Consider the sentence: "The cat, which the family adopted from the shelter last summer because their children had been begging for a pet for years, was happy." The verb "was" must agree with the subject "cat" — 15 words earlier. For the RNN to learn this agreement, the gradient from the loss at "was" must travel 15 timesteps back to influence how "cat" is processed. If λ_max = 0.9, the gradient at "cat" is 0.9¹⁵ = 0.206 — already reduced by 80%. At 0.8, it's 0.8¹⁵ = 0.035. At 0.7, it's 0.005. The learning signal evaporates.

Below, explore this interactively. The left panel shows how a single value changes under repeated multiplication — adjust the eigenvalue slider to see exponential growth, stability, or decay. The right panel shows an unrolled RNN with gradient arrows whose thickness represents gradient magnitude at each timestep.

Vanishing and Exploding Gradients

Drag the eigenvalue slider. Left panel: repeated multiplication curve. Right panel: gradient thickness flowing back through the RNN. At 0.9, early gradients vanish. At 1.1, they explode.

Eigenvalue λ0.90

Gradient Clipping: Fixing Explosions

Exploding gradients have a simple fix: gradient clipping. If the gradient norm exceeds a threshold, scale it down.

g ← (threshold / ||g||) · g if ||g|| > threshold

This caps the step size without changing the direction. Typical threshold: 1.0 or 5.0. It's a standard practice for any RNN training and only adds one line of code:

python
# After loss.backward(), before optimizer.step()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)

But vanishing gradients have no such simple fix. You can't amplify a gradient that's already zero — you've lost the signal entirely. The information about what happened 50 timesteps ago is gone. The network has "forgotten." Solving this requires a fundamentally different architecture — one where information can flow through time without being multiplied by W_h at every step.

The Math Behind the Decay

Let's be precise about why the gradient vanishes. The recurrence is h_t = tanh(W_h h_t−1 + W_x x_t + b). The Jacobian of h_t with respect to h_t−1 is:

∂h_t/∂h_t−1 = diag(1 − h_t²) · W_h

The diag(1 − h_t²) term comes from the derivative of tanh. Since tanh outputs values in (−1, 1), the diagonal entries are all in (0, 1). This means each Jacobian has norm less than ||W_h||. Over T timesteps, the product of T such Jacobians has norm bounded by ||W_h||^T (approximately).

If ||W_h|| < 1, the product goes to zero. If ||W_h|| > 1, the product goes to infinity. There's no middle ground — the boundary between vanishing and exploding is a knife edge.

Evidence from Real Models

Bengio et al. (1994) proved this theoretically and confirmed it experimentally. They showed that a vanilla RNN trained on a simple sequence memorization task could reliably learn dependencies up to about 10 timesteps. Beyond 20 timesteps, performance dropped to chance. The network literally could not propagate information that far.

This has practical consequences. An RNN processing a document can't connect a pronoun to its referent if they're more than ~10 words apart. It can't learn that the subject of a long sentence determines the verb conjugation at the end. It can't track that a question mark at position 1 means the sentence is a question at position 30. Short-range patterns: yes. Long-range dependencies: no.

A Worked Gradient Calculation

Let's trace the gradient through 3 timesteps to make this concrete. Consider a simple 1D RNN: h_t = tanh(w_h h_t−1 + w_x x_t), with w_h = 0.8 and all hidden states near 0 (so tanh' ≈ 1).

The gradient of h₃ with respect to h₀ is:

∂h₃/∂h₀ = (∂h₃/∂h₂) · (∂h₂/∂h₁) · (∂h₁/∂h₀) = w_h · w_h · w_h = 0.8³ = 0.512

After 3 steps, the gradient has already lost half its magnitude. Extend to 10 steps:

∂h₁₀/∂h₀ = 0.8¹⁰ = 0.107

After 10 steps: 89% of the gradient signal is lost. After 50 steps: 0.8⁵⁰ = 0.000014. The learning signal for long-range dependencies is 70,000 times weaker than for short-range dependencies. The network will learn local patterns (3–5 words apart) quickly, and never learn distant patterns at all.

Now consider w_h = 1.1 (slightly greater than 1):

1.1¹⁰ = 2.59, 1.1⁵⁰ = 117, 1.1¹⁰⁰ = 13,781

The gradient explodes. Weight updates become enormous, weights fly to infinity, and training diverges. You'll see NaN in your loss and the model is destroyed. Gradient clipping prevents this, but the vanishing problem remains unsolved by clipping.

The slider above shows one eigenvalue at a time. Now build the actual gradient-flow product and plot the decay across 60 timesteps — for the vanilla RNN against the LSTM's additive path you'll meet in Chapter 6.

This is why an RNN can't connect "cat" to a verb 15 words later. The gradient passes through 15 matrix multiplications. By then, it's essentially zero. The network can learn local patterns (within 5–8 words) but is blind to long-range dependencies. This was the central unsolved problem in NLP from 1990 to 1997.

If the largest eigenvalue of W_h is 0.8, what happens to the gradient after 50 timesteps?

Exponential decay: 0.8⁵⁰ ≈ 0.000014 — effectively zero Linear decay to 0.8 × 50 = 40 It stays constant at 0.8

Chapter 6: LSTM & GRU

What if the network could choose what to remember? Instead of blindly multiplying by W_h at every step, what if it had gates — learned valves that open and close to control information flow? That's the Long Short-Term Memory (LSTM) cell, invented by Hochreiter and Schmidhuber in 1997.

The key innovation is the cell state c_t — a separate memory track that runs alongside the hidden state h_t. Information can be added to or removed from the cell state via gates, but the default operation is addition, not multiplication. This is the crucial difference.

The Three Gates

An LSTM has three gates, each a sigmoid layer that outputs values between 0 (fully closed) and 1 (fully open):

Forget gate f_t: decides what to erase from the cell state. "Should I forget the old subject now that a new clause has started?"

f_t = σ(W_f [h_t−1, x_t] + b_f)

Input gate i_t: decides what new information to write. "Is this word important enough to store?"

i_t = σ(W_i [h_t−1, x_t] + b_i)

Output gate o_t: decides what to read out from the cell state. "What part of my memory is relevant right now?"

o_t = σ(W_o [h_t−1, x_t] + b_o)

The Cell State Update

First, compute a candidate new value to potentially write:

c̃_t = tanh(W_c [h_t−1, x_t] + b_c)

Then update the cell state by forgetting some old information and adding some new:

c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t

And compute the hidden state (the output of this cell):

h_t = o_t ⊙ tanh(c_t)

The symbol ⊙ means element-wise multiplication. Each gate independently controls each dimension of the cell state. This is important: gate values are not scalars. Each dimension of the cell state has its own forget/input/output gate value. The network can choose to forget information in dimension 3 while retaining it in dimension 47.

A Worked LSTM Step

Let's trace one step with a 2D cell state (tiny, but reveals the mechanics). Suppose:

h_t−1 = [0.3, −0.5] c_t−1 = [0.8, 0.2] x_t = embedding of "sat"

After applying the gate formulas (with learned weights), suppose we get:

f_t = [0.9, 0.3] i_t = [0.1, 0.8] o_t = [0.7, 0.6] c̃_t = [0.5, −0.4]

Read this: "Keep 90% of cell dimension 0 (where we stored the subject 'cat'). Forget 70% of dimension 1 (less important info). Write new info mostly to dimension 1." Now compute:

c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t
= [0.9 × 0.8, 0.3 × 0.2] + [0.1 × 0.5, 0.8 × (−0.4)]
= [0.72, 0.06] + [0.05, −0.32] = [0.77, −0.26]

Dimension 0 went from 0.8 to 0.77 — almost unchanged. The forget gate (0.9) preserved the stored subject information. Dimension 1 went from 0.2 to −0.26 — mostly overwritten with new information about "sat." This selective memory is what makes LSTMs powerful.

c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t. This is ADDITION, not multiplication by W_h. If the forget gate is 1.0 and the input gate is 0.0, the cell state passes through unchanged: c_t = c_t−1. Gradients flow along this additive path without repeated matrix multiplication. That's how LSTMs solve the vanishing gradient problem.

Interactive LSTM Cell

Below, step through words one at a time and watch the three gates activate. The gate "fill levels" show their activations (0 = closed/empty, 1 = open/full). The cell state bar tracks long-term memory. Compare to a vanilla RNN: the LSTM selectively remembers and forgets.

LSTM Cell: Gates and Memory

Click "Next Word" to advance through a sentence. Watch the forget, input, and output gates open and close. The cell state bar shows long-term memory strength — compare with a vanilla RNN that forgets rapidly.

Step 0 — cell state and hidden state initialized.

GRU: A Simpler Alternative

The Gated Recurrent Unit (GRU), proposed by Cho et al. in 2014, simplifies the LSTM to two gates and no separate cell state. It's cheaper to compute and often performs comparably.

Feature	LSTM	GRU
Gates	3 (forget, input, output)	2 (reset, update)
Cell state	Separate c_t track	No separate track — uses h_t only
Parameters	4 × (d_h² + d_h × d_x)	3 × (d_h² + d_h × d_x)
Typical use	Default for sequences	When compute is limited
Performance	Slight edge on long sequences	Comparable on most tasks

GRU equations for reference:

r_t = σ(W_r [h_t−1, x_t]) (reset gate)
z_t = σ(W_z [h_t−1, x_t]) (update gate)
h̃_t = tanh(W [r_t ⊙ h_t−1, x_t])
h_t = (1 − z_t) ⊙ h_t−1 + z_t ⊙ h̃_t

The update gate z_t plays a dual role: it controls both how much old state to keep (1 − z_t) and how much new state to write (z_t). When z_t = 0, the hidden state passes through unchanged. When z_t = 1, the old state is completely replaced. The reset gate r_t controls how much of the old hidden state to use when computing the candidate.

Why Gates Work: The Gradient Highway

Let's trace the gradient through the LSTM cell state update. Recall c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t. The gradient of c_T with respect to c_t (many steps earlier) is:

∂c_T/∂c_t = ∏_k=t+1^T f_k

Compare this to the vanilla RNN gradient ∏ W_h · diag(tanh'). The LSTM gradient is just a product of scalar forget gate values, each between 0 and 1. If the forget gates are close to 1 (meaning "keep everything"), the gradient flows almost without decay. No matrix multiplication, no eigenvalue issues.

This is why the forget gate bias is often initialized to a positive value (1.0 or 2.0) — to ensure the gates start open and gradients flow. Gers et al. (2000) showed this initialization trick dramatically improved LSTM training. The forget gate bias is one of those small details that makes a huge practical difference.

Peephole Connections

In the standard LSTM, gates only see h_t−1 and x_t. They don't see the cell state c_t−1 directly. Peephole connections (Gers & Schmidhuber, 2000) add c_t−1 as an input to the gates:

f_t = σ(W_f [h_t−1, x_t] + w_cf ⊙ c_t−1 + b_f)

This lets the gates "peek" at the cell state when deciding what to forget or remember. Peephole connections help on tasks requiring precise timing (e.g., counting timesteps), but they add parameters and are not used in most modern implementations. PyTorch's nn.LSTM does not include peephole connections by default.

LSTM vs Vanilla RNN: Parameter Count

An LSTM has 4 sets of weight matrices (one for each of the 3 gates + the candidate), while a vanilla RNN has just 1. For hidden_dim = 512 and embed_dim = 256:

Model	Weight Matrices	Total Parameters (approx)
Vanilla RNN	W_h [512×512] + W_x [512×256]	393K
LSTM	4 × (W [512×768]) = 4 sets	1.57M (4x more)
GRU	3 × (W [512×768]) = 3 sets	1.18M (3x more)

LSTMs are ~4x more expensive per step than vanilla RNNs. But they work on sequences where vanilla RNNs fail. In practice, a 2-layer LSTM with 650 hidden units was the state-of-the-art language model architecture from 2016–2018 (Zaremba et al., Merity et al.).

Training Tricks for LSTM Language Models

The best LSTM language models use several regularization and architectural tricks that are worth knowing:

Weight tying (Press & Wolf, 2017): The embedding matrix E [|V| × d] and the output weight matrix W_y [d × |V|] serve symmetric roles — one maps word IDs to vectors, the other maps vectors back to word IDs. Sharing (tying) these weights halves the model's parameter count and acts as regularization. In code: model.output.weight = model.embed.weight.

Variational dropout (Gal & Ghahramani, 2016): Standard dropout applies a different random mask at every timestep. This breaks the temporal structure — the hidden state is corrupted unpredictably at each step. Variational dropout uses the same mask at every timestep within a sequence. This preserves temporal coherence while still regularizing.

Weight dropout (Merity et al., 2018): Instead of dropping activations, drop random weights in W_h before the forward pass. This is equivalent to a random perturbation of the recurrence matrix and provides strong regularization without disrupting hidden state dynamics.

ASGD (averaged SGD): After initial training with SGD, switch to averaged SGD, which maintains a running average of the weights over training steps. The averaged weights often generalize better than the final weights. This trick gave a 2–3 point perplexity improvement in the AWD-LSTM paper.

python
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim,
                            num_layers=num_layers, dropout=0.3,
                            batch_first=True)
        self.proj = nn.Linear(hidden_dim, vocab_size)
        self.drop = nn.Dropout(0.3)

    def forward(self, x, hidden=None):
        # x: [batch, seq_len]
        e = self.drop(self.embed(x))          # [batch, seq_len, embed_dim]
        out, hidden = self.lstm(e, hidden)     # out: [batch, seq_len, hidden_dim]
        logits = self.proj(self.drop(out))     # [batch, seq_len, vocab_size]
        return logits, hidden

What lets LSTMs avoid the vanishing gradient problem that plagues vanilla RNNs?

Using larger weight matrices The additive cell state update (c_t = f ⊙ c_t-1 + i ⊙ c̃) instead of multiplicative hidden state Training with a higher learning rate

Chapter 7: RNN Language Model

Time to see it all come together. Below is a character-level RNN language model running live in your browser. It reads one character at a time, updates its hidden state, and outputs a probability distribution over the next character. You can watch the hidden state evolve, see how probabilities shift, and control the "creativity" of generation with a temperature slider.

This is the exact architecture Andrej Karpathy demonstrated in "The Unreasonable Effectiveness of Recurrent Neural Networks" — a simple character-level LSTM that, given enough training data, can generate surprisingly coherent text. Our version uses pre-computed weights trained on a small English corpus.

How Temperature Works

Before sampling, we divide the logits by a temperature parameter T:

P(w_i) = exp(z_i/T) / ∑_j exp(z_j/T)

T → 0: The distribution becomes a spike on the most likely character. Output is deterministic but repetitive. The model picks the safest option every time.

T = 1.0: The original distribution. The model samples according to its learned probabilities. This is the "true" model.

T → ∞: The distribution becomes uniform. Every character is equally likely. Output is random noise.

Temperature 0.2: safe but repetitive. Temperature 2.0: creative but nonsensical. The sweet spot around 0.8–1.0 is where language "feels" real. Every text generation system — including ChatGPT — has this temperature parameter.

The canvas below has three sections. The top shows the seed text and generated output. The middle displays the hidden state as a colored heatmap strip (each cell is one hidden unit). The bottom shows probability bars for the top-10 next characters. Click "Generate" to start auto-generating. Use the temperature slider to control creativity. Pause and resume at any time.

Live Character-Level RNN

Type a seed, then click "Generate" to watch the RNN extend it character by character. The hidden state heatmap and probability bars update in real time. Adjust temperature to control creativity.

Temperature0.8

Speed (ms)150

Try different temperatures. At 0.2, the model repeats common patterns ("the the the..."). At 1.0, it produces mostly-English-like text with occasional errors. At 2.0, it becomes creative but makes frequent mistakes. This temperature-quality tradeoff is universal across all language models.

What the Hidden State Encodes

Watch the heatmap as the model generates. Some hidden units "light up" consistently after spaces (word boundary detectors). Others activate inside quotes. Others track whether we're in the middle of a word or at the start. The network discovered these features automatically through training — nobody programmed them.

Karpathy found individual hidden units that tracked parenthesis depth, quote state, line position, and even whether the current text was a comment in code. The RNN learned useful representations of text structure from raw characters alone.

Why Character-Level?

Word-level language models have a fixed vocabulary. If the model has never seen "TensorFlow" during training, it can't generate or process it. Character-level models work with just ~100 characters (letters, digits, punctuation). They can generate any word — including novel ones — character by character.

The tradeoff: character-level models need much longer sequences to capture the same context. "The cat sat on the mat" is 6 tokens at the word level but 26 characters. The RNN must carry information through 4x more timesteps. This makes the vanishing gradient problem even worse — which is why character-level models really need LSTMs, not vanilla RNNs.

Aspect	Word-Level	Character-Level
Vocabulary	50,000+ words	~100 characters
OOV problem	Yes — unseen words fail	No — can generate any string
Sequence length	Short (20-50 steps)	Long (100-500 steps)
Output layer	Huge softmax (50K classes)	Tiny softmax (~100 classes)
Training	Faster per epoch	Slower (longer sequences)
Modern replacement	Subword tokenization (BPE) — the best of both worlds

The Training Objective

The model is trained with cross-entropy loss at each timestep: given the true next character, maximize its log probability. In code:

python
# Training loop for char-level RNN
for batch in data_loader:
    inputs, targets = batch       # [batch, seq_len] character indices
    logits, hidden = model(inputs, hidden)

    # Reshape for cross-entropy: flatten time dimension
    loss = F.cross_entropy(
        logits.view(-1, vocab_size),  # [batch*seq_len, vocab_size]
        targets.view(-1)               # [batch*seq_len]
    )

    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
    optimizer.step()
    hidden = hidden.detach()  # truncated BPTT: stop gradient at chunk boundary

The hidden.detach() call is crucial. It carries the hidden state forward (so the model sees long-range context) but cuts the gradient (so backprop only flows through the current chunk). This is truncated BPTT in action.

What Makes a Good Character-Level Model

The quality of generated text depends heavily on three factors:

1. Training data quality and quantity. Karpathy trained on 1MB of Shakespeare (~500K characters). The model learned iambic pentameter, character names, stage directions. With Linux source code, it learned to open and close brackets, write comments, and use proper indentation. The model mirrors its training data's style with surprising fidelity.

2. Hidden state size. A hidden state of 64 units can learn basic word patterns but not long-range structure. 256 units can track sentence-level structure. 512–1024 units (the typical range for serious char-level models) can learn paragraph-level coherence and even some semantic patterns.

3. Architecture depth. A single-layer LSTM generates locally coherent text. A 2–3 layer stacked LSTM generates text with better long-range structure, proper paragraph transitions, and more consistent style. Beyond 3 layers, gains diminish and training becomes difficult.

The simplest useful language model is a 2-layer LSTM with 650 hidden units, trained on ~100MB of text. This was the architecture used by Zaremba et al. (2014) to set PTB benchmarks that stood for 3 years. With proper regularization (dropout, weight tying), this tiny model achieves perplexity ~82 on PTB — competitive with 5-gram models that store billions of counts.

Chapter 8: The Attention Preview

Translate: "The agreement on the European Economic Area was signed in August 1992." The LSTM encoder reads all 13 words, updating its hidden state at each step. After the last word, it passes the final hidden state to the decoder. All 13 words of meaning, syntax, entities, and structure — crammed into one vector of 512 numbers.

This is the encoder-decoder architecture (also called sequence-to-sequence or seq2seq) for tasks where input and output are both sequences but of different lengths. Sutskever et al. (2014) showed this works for machine translation, dialogue, and summarization. The encoder reads the source sentence and produces a fixed-size vector. The decoder generates the target sentence one word at a time, conditioned on that vector.

The encoder is an LSTM that reads the input left-to-right. Its final hidden state h_T is called the context vector or thought vector — a compressed representation of the entire input. The decoder is a separate LSTM initialized with this context vector. At each step, it predicts the next output word given the context and its own previous outputs.

python
class Seq2Seq(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_embed, d_hidden):
        super().__init__()
        self.encoder = nn.LSTM(d_embed, d_hidden, batch_first=True)
        self.decoder = nn.LSTM(d_embed, d_hidden, batch_first=True)
        self.src_embed = nn.Embedding(src_vocab, d_embed)
        self.tgt_embed = nn.Embedding(tgt_vocab, d_embed)
        self.output = nn.Linear(d_hidden, tgt_vocab)

    def forward(self, src, tgt):
        # Encode: [batch, src_len] → final hidden state
        _, (h, c) = self.encoder(self.src_embed(src))
        # Decode: use encoder's final state as initial state
        out, _ = self.decoder(self.tgt_embed(tgt), (h, c))
        return self.output(out)  # [batch, tgt_len, tgt_vocab]

The bottleneck is obvious: no matter how long the input is, the decoder only gets one vector (h, c). A 5-word sentence? One vector. A 50-word sentence? Same vector. Information gets crushed. In practice, encoder-decoder models with LSTMs work well for short sentences (~20 words) but degrade rapidly on longer inputs. Cho et al. (2014) showed BLEU scores dropping dramatically as source sentence length increased beyond 20 words — the model just couldn't compress that much information.

Sutskever et al. discovered a useful trick: reversing the input sentence. Instead of feeding "the cat sat" as [the, cat, sat], they fed it as [sat, cat, the]. This way, the first words of the input (which are most important for generating the first words of the output) are closest in time to the decoder. This reduced the average distance between corresponding input and output words and improved BLEU scores by several points. But it's a hack — it doesn't solve the fundamental bottleneck.

The Information Bottleneck in Numbers

How much information can a 512-dimensional vector hold? In floating-point representation, each number carries about 23 bits of precision. So a 512-dim vector holds roughly 512 × 23 = 11,776 bits. A typical English sentence has about 5 bits per character (Shannon's estimate), or about 50 bits per word. That means a 512-dim vector can theoretically encode about 235 words of information — enough for short sentences.

But the LSTM doesn't achieve optimal compression. The hidden state must encode not just the meaning of the words but their syntactic structure, word order, entity relationships, and discourse context. In practice, Cho et al. (2014) found performance degrading severely beyond 20 words. The effective capacity is far less than the theoretical maximum.

Attention solves this by providing a "shortcut" — the decoder doesn't need to extract everything from one vector because it can access the encoder states directly. The fixed-size bottleneck becomes an optional summary rather than the sole channel of information.

The Bottleneck Visualized

Below, drag the sentence length slider to see how the "compression pressure" changes. At 5 words, the bottleneck is comfortable. At 30 words, it's glowing red — too much information for one vector. Then click "Toggle Attention" to see what happens when the decoder can look directly at every encoder position instead of relying on a single summary vector.

Encoder-Decoder Bottleneck vs. Attention

Slide to change sentence length. Watch the bottleneck pressure indicator. Click "Toggle Attention" to add direct connections from decoder to all encoder positions.

Sentence length10

What Is Attention?

The idea (Bahdanau et al., 2015): instead of compressing the entire input into one vector, let the decoder look back at every encoder hidden state. At each decoding step, compute a weighted sum of all encoder hidden states, where the weights represent "how much should I focus on each input word right now?"

α_t,s = softmax(score(h_t^dec, h_s^enc))

context_t = ∑_s α_t,s · h_s^enc

At timestep t, the decoder computes a score between its current state h_t^dec and every encoder state h_s^enc. These scores are softmaxed into attention weights α. The context vector is a weighted average of encoder states. The decoder uses this context vector (plus its own state) to predict the next word.

This changes everything. When translating "Area" into French, the decoder can attend directly to "Area" and "Economic" in the source, ignoring "August" and "1992." Each output word attends to exactly the input words that are relevant to it. No more bottleneck.

Instead of compressing everything into one vector, let the decoder LOOK BACK at every input position. This is what Vaswani et al. (2017) built an entire architecture around — removing the RNN completely and using only attention. That architecture is the Transformer, and it powers every large language model today.

From RNN+Attention to Transformer

The progression is clean:

Architecture	Year	Context Mechanism	Limitation
N-gram	1990s	Count previous n−1 words	Sparsity, no generalization
Neural LM	2003	Fixed window of embeddings	Window size is fixed
RNN	2010	Hidden state carries memory	Vanishing gradients, sequential
LSTM/GRU	1997/2014	Gated cell state + hidden state	Still sequential, bottleneck at long range
RNN + Attention	2015	Look back at all encoder states	Still sequential computation
Transformer	2017	Self-attention — attention only, no RNN	O(n²) in sequence length

The Transformer took the attention mechanism and said: "Why do we need the RNN at all? What if attention IS the architecture?" By removing the sequential dependency, Transformers can process all positions in parallel on GPUs. Training time dropped by an order of magnitude. That's what unlocked GPT, BERT, and every large language model since.

Attention in Code

The core attention computation is surprisingly simple:

python
def attention(decoder_hidden, encoder_outputs):
    # decoder_hidden: [batch, d_h]        — current decoder state
    # encoder_outputs: [batch, src_len, d_h] — all encoder hidden states

    # Score: dot product between decoder state and each encoder state
    scores = torch.bmm(
        encoder_outputs,                            # [batch, src_len, d_h]
        decoder_hidden.unsqueeze(2)                  # [batch, d_h, 1]
    ).squeeze(2)                                    # [batch, src_len]

    # Normalize to attention weights
    attn_weights = F.softmax(scores, dim=1)          # [batch, src_len]

    # Weighted sum of encoder outputs
    context = torch.bmm(
        attn_weights.unsqueeze(1),                    # [batch, 1, src_len]
        encoder_outputs                              # [batch, src_len, d_h]
    ).squeeze(1)                                     # [batch, d_h]

    return context, attn_weights

Three lines of math: score, softmax, weighted sum. That's it. This simple mechanism solved the bottleneck problem and opened the door to Transformers. In the next lecture (L05), we'll see how self-attention — where the query, key, and value all come from the same sequence — replaces recurrence entirely.

Three lines — so build them. Below you implement the score and the softmax, and the attention matrix over a real source/target sentence pair lights up to show which input word each output word looks at.

Types of Attention Scoring

The dot product score above is just one option. The three main scoring functions are:

Name	Formula	Notes
Dot product	h_dec^T h_enc	Simplest, fastest. Requires same dimensionality.
Multiplicative	h_dec^T W_a h_enc	Learnable W_a adds flexibility. Luong (2015).
Additive	v_a^T tanh(W₁h_dec + W₂h_enc)	Most flexible. Bahdanau (2015). Two matrices + a vector.

The Transformer uses scaled dot product attention: divide the dot product by √d_k to prevent the softmax from becoming too peaked in high dimensions. This subtle scaling is critical for training stability.

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Where Q (queries), K (keys), and V (values) are all derived from the input. In self-attention, Q, K, and V all come from the same sequence, transformed by different learned weight matrices. This allows each position to attend to every other position — and all positions are processed in parallel. No sequential bottleneck, no vanishing gradients across distance, just direct connections. That's the revolution that the next lecture covers.

What limitation of encoder-decoder RNNs does attention solve?

The fixed-size bottleneck — the decoder no longer has to rely on a single compressed vector The vanishing gradient problem — attention makes gradients stable The vocabulary size limit — attention allows unlimited vocabulary

Chapter 9: Connections

We've traced the evolution from the simplest possible language model (counting word pairs) through neural language models, recurrent neural networks, and gated architectures. Each solved a problem the previous one couldn't — and each introduced a new limitation that drove the next innovation.

The Evolution

N-grams

Count-based. Sparsity and no generalization.

↓

Neural LMs (Bengio 2003)

Embeddings solve sparsity. Fixed window is the bottleneck.

↓

RNNs (Elman 1990)

Hidden state carries memory. Vanishing gradients limit range.

↓

LSTM (Hochreiter 1997) / GRU (Cho 2014)

Gated memory solves vanishing gradients. Still sequential.

↓

Attention (Bahdanau 2015)

Direct access to all encoder states. Still needs RNN backbone.

↓

Transformer (Vaswani 2017)

Attention IS the architecture. Fully parallel. Current paradigm.

Seminal Papers

These papers provide the theoretical and experimental foundation for this lesson:

Learning Long-Term Dependencies with Gradient Descent is Difficult (Bengio et al., 1994) — The paper that formally proved why vanilla RNNs can't learn long-range dependencies. Showed that the gradient signal decays (or explodes) exponentially with the number of timesteps, making it theoretically impossible for gradient-based methods to connect distant events in a sequence.

On the Difficulty of Training Recurrent Neural Networks (Pascanu et al., 2013) — A more detailed analysis of the gradient flow problem, with practical solutions including gradient clipping and the echo state property. Showed that the problem is fundamentally about the spectral radius of the recurrence weight matrix.

Attention Is All You Need (Vaswani et al., 2017) — The paper that replaced RNNs entirely with self-attention. Showed that attention alone (no recurrence, no convolution) achieves state-of-the-art on machine translation while training orders of magnitude faster. The birth of the Transformer.

Where RNNs Still Appear

RNNs have been largely replaced by Transformers for language tasks, but they're not dead:

Domain	Why RNNs/LSTMs persist
Edge devices	LSTM models are tiny (1-10M params) and can run on microcontrollers
Time series	Real-time sensor data, financial forecasting — streaming, sequential nature fits RNNs
Reinforcement learning	LSTM policies are common in robotics and game agents (smaller state spaces)
State space models	Mamba (Gu & Dao, 2023) recasts recurrence as selective SSMs — linear-time attention alternative

What We Didn't Cover

Topic	Why It Matters	Where to Learn More
Bidirectional RNNs	Read left-to-right AND right-to-left, concat both hidden states. Better representations but can't be used for language generation.	CS224N L05 (ELMo, BERT)
Subword tokenization	BPE and WordPiece solve the OOV problem without going to character-level. Used by all modern models.	Tokenization deep-dive
Dropout for RNNs	Naive dropout breaks temporal coherence. Variational dropout (same mask at every timestep) is required.	Gal & Ghahramani, 2016
State space models	Mamba, S4, and linear attention bring back the recurrent paradigm with O(n) complexity and no vanishing gradients.	SSM & Mamba lesson

Next Steps

← L03: Backpropagation and Neural Networks — The training algorithm that makes all of this possible.

L05: Transformers and Self-Attention → — Removing recurrence entirely. Attention is all you need. (Coming soon)

Performance Timeline

Penn Treebank (PTB) perplexity scores show the progression clearly:

Year	Model	Test Perplexity
2003	KN 5-gram (count-based)	141
2012	RNN-LM (Mikolov)	124
2014	LSTM (Zaremba, large)	82
2017	AWD-LSTM (Merity)	58
2018	Transformer-XL	24
2019	GPT-2 (1.5B)	~18

Each generation — n-grams to RNNs to LSTMs to Transformers — brought a dramatic improvement. The jump from LSTM to Transformer was particularly striking because it came not from a better learning algorithm, but from a better architecture: one that could be parallelized on GPUs and scale to billions of parameters.

Notice the acceleration: count-based models dominated for ~15 years (1990–2005). Neural LMs for ~7 years (2003–2010). Vanilla RNNs for ~4 years (2010–2014). LSTMs for ~3 years (2014–2017). Transformers have dominated since 2017 — but even within Transformers, the pace of change is extraordinary: BERT (2018), GPT-2 (2019), GPT-3 (2020), PaLM/Chinchilla (2022), GPT-4 (2023), Llama 3 (2024). Each brought qualitative leaps in capability.

The Key Insight From Each Era

Era	Key Insight
N-grams	Language has statistical regularities that can be captured by counting.
Neural LMs	Words have continuous representations; similar words should share predictions.
RNNs	Networks can have memory; the same computation applied at every step can process variable-length sequences.
LSTMs	Additive memory updates (instead of multiplicative) allow information to persist over long distances.
Attention	Direct connections between distant positions are better than passing information through a chain.
Transformers	Attention alone is sufficient. Remove recurrence and everything can be parallelized.

Each of these insights seems obvious in retrospect. That's the mark of a great idea — it's hard to see before someone shows you, and impossible to unsee after.

Why RNNs Still Matter Pedagogically

Even though Transformers have replaced RNNs for most language tasks, understanding RNNs is essential for three reasons:

First, RNNs are the simplest model that handles variable-length sequences. They strip away the complexity of multi-head attention, positional encodings, and layer normalization to reveal the core challenge: how do you carry information through a sequence?

Second, the vanishing gradient problem is not specific to RNNs. It appears in any deep computation graph. Understanding it in the RNN context — where it's most dramatic and easiest to visualize — builds intuition that transfers to understanding residual connections, gradient highways, and normalization in Transformers.

Third, the gate mechanism invented for LSTMs reappears everywhere in modern deep learning. Mixture-of-experts routing uses gating. Highway networks use gates. The GRU-like update in state space models (Mamba) is a direct descendant of the LSTM update gate. Understanding gates here means understanding them everywhere.

The history of NLP is not a sequence of dead ends. It's a sequence of insights, each building on the last. N-grams taught us that statistics matter. Neural LMs taught us that representations matter. RNNs taught us that memory matters. LSTMs taught us that selective memory matters. Attention taught us that direct access beats sequential propagation. And Transformers taught us that scale matters. Every lesson in that chain is load-bearing.

A Quote to Remember

"The Unreasonable Effectiveness of Recurrent Neural Networks" — Andrej Karpathy's famous 2015 blog post showed that a character-level LSTM, given enough text, could generate plausible Shakespeare, Wikipedia articles, LaTeX source code, and C code. The hidden state — just 512 numbers updated one character at a time — somehow learned the structure of these complex formal languages. RNNs may be superseded, but they remain one of the most beautiful demonstrations that simple recursive computation can produce emergent complexity.