The first neural architecture that could remember — and why remembering turns out to be harder than it sounds.
Suppose you're reading this sentence word by word. Each word changes your understanding of what comes next. "The dog sat on the ___" — your brain screams "mat" or "couch" because it remembers everything that came before.
A normal neural network sees one input and produces one output. Show it the word "the" and it has no idea what came before. It has no memory. But language, music, stock prices, video — they're all sequences where the past shapes the future.
A feedforward network sees each letter in isolation. It can't predict the next character because it doesn't know what came before. Click to step through a word.
The fix is beautifully simple: give the network a hidden state — a vector it updates at every time step and passes to itself. The network reads one element, updates its memory, reads the next, updates again. This is a recurrent neural network.
Think of the hidden state as a notepad the network carries with it. At each time step, it reads the current input, glances at its notepad, decides what to output, then updates the notepad for the next step.
More precisely, the hidden state ht is a vector of numbers that encodes everything the network has seen so far. At time step t, the network combines the current input xt with the previous hidden state ht-1 to produce a new hidden state.
Watch the hidden state evolve as the network reads each character. The bar chart shows which "memory slots" activate for each letter.
The hidden state size is a design choice. A 128-dimensional hidden state has 128 "memory slots." More slots means more capacity to remember, but also more parameters to train.
The simplest recurrent network uses just three weight matrices and a nonlinearity. That's it. Here's the entire forward pass:
Let's break each piece down:
| Symbol | Shape | Meaning |
|---|---|---|
| xt | [input_dim] | Current input (e.g., one-hot character) |
| ht-1 | [hidden_dim] | Previous hidden state (memory) |
| Wxh | [hidden, input] | "How should I read the input?" |
| Whh | [hidden, hidden] | "How should I use my memory?" |
| Why | [output, hidden] | "What should I output?" |
| tanh | — | Squashes values to [-1, 1] |
Watch data flow through the three matrices. The tanh squashes the combined signal into [-1, 1].
python import numpy as np class VanillaRNN: def __init__(self, input_dim, hidden_dim, output_dim): self.Wxh = np.random.randn(hidden_dim, input_dim) * 0.01 self.Whh = np.random.randn(hidden_dim, hidden_dim) * 0.01 self.Why = np.random.randn(output_dim, hidden_dim) * 0.01 self.bh = np.zeros(hidden_dim) self.by = np.zeros(output_dim) def step(self, x, h_prev): h = np.tanh(self.Wxh @ x + self.Whh @ h_prev + self.bh) y = self.Why @ h + self.by return h, y # new hidden state, output logits
Training a normal network means computing gradients with backpropagation. Training an RNN is the same idea, but the computation graph is unrolled across time steps. This is called backpropagation through time (BPTT).
Imagine unfolding the RNN: instead of one recurrent cell, picture T copies of the same cell laid out left to right, each feeding its hidden state to the next. Now it looks like a very deep feedforward network — and you can backpropagate through the whole thing.
Watch the gradient flow backward through time. The same Whh is applied at every step.
In practice, we often use truncated BPTT: instead of backpropagating through the entire sequence, we chunk it into segments (say 25 steps) and backpropagate within each chunk. This trades off long-range gradient accuracy for memory and speed.
Here's the trouble with multiplying the same matrix over and over. Think about repeatedly multiplying a number by 0.9:
0.9 → 0.81 → 0.73 → 0.66 → ... → after 50 steps: 0.005. The signal practically disappears.
The same thing happens to gradients in an RNN. When the gradient flows back through Whh at each time step, it gets multiplied by the derivative of tanh (which is at most 1) and by Whh. If the largest eigenvalue of Whh is less than 1, gradients vanish exponentially. If it's greater than 1, they explode.
Adjust the multiplier to see how gradients vanish (<1) or explode (>1) over 50 time steps.
python # Gradient clipping — the standard fix for explosions grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads)) if grad_norm > max_norm: for g in grads: g *= max_norm / grad_norm # scale down uniformly
The Long Short-Term Memory (LSTM) cell, invented by Hochreiter and Schmidhuber in 1997, solves the vanishing gradient problem with a brilliant idea: give the network an explicit memory cell ct that information can flow through unchanged.
Think of it like a highway. In a vanilla RNN, every piece of information must pass through a tanh bottleneck at every step — it gets squeezed and distorted. The LSTM adds a bypass highway where information can flow across many time steps with only minor, controlled modifications.
| Gate | Formula | Role |
|---|---|---|
| Forget ft | σ(Wf · [ht-1, xt] + bf) | What old memory to erase (0=forget, 1=keep) |
| Input it | σ(Wi · [ht-1, xt] + bi) | What new info to write (0=ignore, 1=write) |
| Output ot | σ(Wo · [ht-1, xt] + bo) | What memory to expose (0=hide, 1=reveal) |
The cell state update is:
Where ⊙ means element-wise multiplication. The forget gate decides how much old memory to keep. The input gate decides how much new information to add. The output gate decides what to expose as the hidden state.
Drag the gate sliders to see how they control information flow through the cell. The cell state (orange bar) persists across time.
Here's where it all comes together. A character-level language model reads text one character at a time and learns to predict the next character. Train it on Shakespeare and it writes Shakespeare-ish text. Train it on code and it writes code-ish text.
The setup is simple:
At generation time, we sample from the output distribution instead of picking the most likely character. A temperature parameter controls how adventurous the sampling is: low temperature means conservative (picks likely characters), high temperature means creative (more random).
Given raw logits for the next character after "th", see how temperature changes the probability distribution.
Time to put it all together. Below is a character-level RNN running entirely in your browser. It's been trained on English text patterns — type a seed phrase and watch it generate character by character, with the hidden state evolving in real time.
Type seed text, then click Generate. Watch the hidden state bars shift as each character is produced.
RNNs were the dominant sequence model from 2013 to 2017. Then the Transformer arrived and changed everything. But RNN ideas live on — and understanding them makes Transformers click.
| Architecture | Year | Key Idea | Limitation |
|---|---|---|---|
| Vanilla RNN | 1986 | Hidden state = memory | Vanishing gradients |
| LSTM | 1997 | Gated cell state highway | Sequential (slow on GPU) |
| GRU | 2014 | Simplified LSTM (2 gates) | Still sequential |
| Attention | 2015 | Direct connections across time | Added to RNNs, not replacing |
| Transformer | 2017 | Attention is all you need | Quadratic in sequence length |
| SSM / Mamba | 2023 | RNN-like recurrence, done right | Active research area |
Now that you understand recurrence and gating, these lessons will make much more sense:
RNNs taught us that a network with memory can model the structure of language, music, and code — one character at a time. That insight changed everything.