Sequence Modeling

Recurrent Neural Networks
From Absolute Zero

The first neural architecture that could remember — and why remembering turns out to be harder than it sounds.

Prerequisites: Basic Python + High school algebra. That's it.
9
Chapters
10+
Simulations
0
Assumed ML Knowledge

Chapter 0: Why Sequences Are Special

Suppose you're reading this sentence word by word. Each word changes your understanding of what comes next. "The dog sat on the ___" — your brain screams "mat" or "couch" because it remembers everything that came before.

A normal neural network sees one input and produces one output. Show it the word "the" and it has no idea what came before. It has no memory. But language, music, stock prices, video — they're all sequences where the past shapes the future.

The core problem: How do you give a neural network memory? How do you let it look at a sequence one element at a time and remember what it saw?
The Trouble With No Memory

A feedforward network sees each letter in isolation. It can't predict the next character because it doesn't know what came before. Click to step through a word.

The fix is beautifully simple: give the network a hidden state — a vector it updates at every time step and passes to itself. The network reads one element, updates its memory, reads the next, updates again. This is a recurrent neural network.

Why can't a standard feedforward network handle sequences naturally?

Chapter 1: The Hidden State — A Network That Remembers

Think of the hidden state as a notepad the network carries with it. At each time step, it reads the current input, glances at its notepad, decides what to output, then updates the notepad for the next step.

More precisely, the hidden state ht is a vector of numbers that encodes everything the network has seen so far. At time step t, the network combines the current input xt with the previous hidden state ht-1 to produce a new hidden state.

Key insight: The hidden state is a lossy compression of the entire history. It can't remember everything — it has a fixed number of slots. What it remembers depends on what the weights learned to keep.
Hidden State as a Notepad

Watch the hidden state evolve as the network reads each character. The bar chart shows which "memory slots" activate for each letter.

The hidden state size is a design choice. A 128-dimensional hidden state has 128 "memory slots." More slots means more capacity to remember, but also more parameters to train.

What does the hidden state represent?

Chapter 2: The Vanilla RNN — Three Matrices

The simplest recurrent network uses just three weight matrices and a nonlinearity. That's it. Here's the entire forward pass:

ht = tanh( Wxh · xt + Whh · ht-1 + bh )
yt = Why · ht + by

Let's break each piece down:

SymbolShapeMeaning
xt[input_dim]Current input (e.g., one-hot character)
ht-1[hidden_dim]Previous hidden state (memory)
Wxh[hidden, input]"How should I read the input?"
Whh[hidden, hidden]"How should I use my memory?"
Why[output, hidden]"What should I output?"
tanhSquashes values to [-1, 1]
Think of it this way: Wxh is "what's happening now," Whh is "what I remember," and tanh mixes them together into a new memory. Why reads that memory to produce output.
Vanilla RNN Forward Pass

Watch data flow through the three matrices. The tanh squashes the combined signal into [-1, 1].

python
import numpy as np

class VanillaRNN:
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.Wxh = np.random.randn(hidden_dim, input_dim) * 0.01
        self.Whh = np.random.randn(hidden_dim, hidden_dim) * 0.01
        self.Why = np.random.randn(output_dim, hidden_dim) * 0.01
        self.bh = np.zeros(hidden_dim)
        self.by = np.zeros(output_dim)

    def step(self, x, h_prev):
        h = np.tanh(self.Wxh @ x + self.Whh @ h_prev + self.bh)
        y = self.Why @ h + self.by
        return h, y  # new hidden state, output logits
In the vanilla RNN equation, what does Whh · ht-1 represent?

Chapter 3: Backpropagation Through Time

Training a normal network means computing gradients with backpropagation. Training an RNN is the same idea, but the computation graph is unrolled across time steps. This is called backpropagation through time (BPTT).

Imagine unfolding the RNN: instead of one recurrent cell, picture T copies of the same cell laid out left to right, each feeding its hidden state to the next. Now it looks like a very deep feedforward network — and you can backpropagate through the whole thing.

Forward
Run the RNN over all T time steps, saving every ht
Loss
Compare each output yt to the target. Sum the losses.
Backward
Backpropagate from step T to step 1 through every saved ht
Update
Sum up gradients across all time steps. Update Wxh, Whh, Why.
The key realization: The same weights Whh are multiplied at every time step. When you backpropagate through 100 steps, the gradient flows through Whh 100 times. This is about to cause trouble.
Unrolled RNN Computation Graph

Watch the gradient flow backward through time. The same Whh is applied at every step.

In practice, we often use truncated BPTT: instead of backpropagating through the entire sequence, we chunk it into segments (say 25 steps) and backpropagate within each chunk. This trades off long-range gradient accuracy for memory and speed.

Why is it called "backpropagation through time"?

Chapter 4: The Vanishing Gradient Problem

Here's the trouble with multiplying the same matrix over and over. Think about repeatedly multiplying a number by 0.9:

0.9 → 0.81 → 0.73 → 0.66 → ... → after 50 steps: 0.005. The signal practically disappears.

The same thing happens to gradients in an RNN. When the gradient flows back through Whh at each time step, it gets multiplied by the derivative of tanh (which is at most 1) and by Whh. If the largest eigenvalue of Whh is less than 1, gradients vanish exponentially. If it's greater than 1, they explode.

Vanishing gradients: The network can learn that "a" predicts "b" one step later, but it can't learn that a word 50 steps ago matters for the current prediction. Information from the distant past gets lost.
Gradient Flow Over Time

Adjust the multiplier to see how gradients vanish (<1) or explode (>1) over 50 time steps.

Multiplier 0.90
Exploding gradients have a simple fix: gradient clipping. If the gradient norm exceeds a threshold, scale it down. Vanishing gradients are harder — you need a new architecture.
python
# Gradient clipping — the standard fix for explosions
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if grad_norm > max_norm:
    for g in grads:
        g *= max_norm / grad_norm  # scale down uniformly
If Whh has a largest eigenvalue of 0.8, what happens to a gradient signal after 100 time steps?

Chapter 5: LSTM — Learning to Remember and Forget

The Long Short-Term Memory (LSTM) cell, invented by Hochreiter and Schmidhuber in 1997, solves the vanishing gradient problem with a brilliant idea: give the network an explicit memory cell ct that information can flow through unchanged.

Think of it like a highway. In a vanilla RNN, every piece of information must pass through a tanh bottleneck at every step — it gets squeezed and distorted. The LSTM adds a bypass highway where information can flow across many time steps with only minor, controlled modifications.

The LSTM has three gates — each a sigmoid (0 to 1) that controls information flow. Think of them as valves on a pipe.
GateFormulaRole
Forget ftσ(Wf · [ht-1, xt] + bf)What old memory to erase (0=forget, 1=keep)
Input itσ(Wi · [ht-1, xt] + bi)What new info to write (0=ignore, 1=write)
Output otσ(Wo · [ht-1, xt] + bo)What memory to expose (0=hide, 1=reveal)

The cell state update is:

ct = ft ⊙ ct-1 + it ⊙ tanh(Wc · [ht-1, xt] + bc)
ht = ot ⊙ tanh(ct)

Where ⊙ means element-wise multiplication. The forget gate decides how much old memory to keep. The input gate decides how much new information to add. The output gate decides what to expose as the hidden state.

LSTM Gate Visualizer

Drag the gate sliders to see how they control information flow through the cell. The cell state (orange bar) persists across time.

Forget0.90
Input0.50
Output0.70
Why this solves vanishing gradients: When the forget gate is 1 and the input gate is 0, the cell state ct = ct-1 exactly. Gradients flow straight back through the cell state highway — no vanishing, no explosion.
An LSTM reads a long paragraph and encounters a closing quote mark. Which gate most likely activates?

Chapter 6: Character-Level Language Models

Here's where it all comes together. A character-level language model reads text one character at a time and learns to predict the next character. Train it on Shakespeare and it writes Shakespeare-ish text. Train it on code and it writes code-ish text.

The setup is simple:

Input
One-hot encode the current character (e.g., "h" → [0,0,0,0,0,0,0,1,0,...,0])
RNN Step
Feed into RNN with previous hidden state → get new ht
Output
Why · ht → logits over all characters → softmax → probabilities
Loss
Cross-entropy between predicted distribution and actual next character
↻ repeat for every character in the training text

At generation time, we sample from the output distribution instead of picking the most likely character. A temperature parameter controls how adventurous the sampling is: low temperature means conservative (picks likely characters), high temperature means creative (more random).

P(c) = exp(logitc / T) / ∑j exp(logitj / T)
Temperature Effect on Sampling

Given raw logits for the next character after "th", see how temperature changes the probability distribution.

Temperature 1.0
Karpathy's result: A 3-layer LSTM with 512 hidden units per layer, trained on Shakespeare for a few hours, produces text that looks remarkably like Shakespeare — with correct indentation, stage directions, and dialogue structure — despite having zero knowledge of English.
What does lowering the temperature do during character generation?

Chapter 7: Showcase — Live Character-Level RNN

Time to put it all together. Below is a character-level RNN running entirely in your browser. It's been trained on English text patterns — type a seed phrase and watch it generate character by character, with the hidden state evolving in real time.

Interactive Character-Level RNN

Type seed text, then click Generate. Watch the hidden state bars shift as each character is produced.

Temperature0.8
Speed (ms)120
Hidden state activation (16 dimensions):
Character probabilities (top 10):
What to try: Start with "the " at temperature 0.8 for coherent-looking text. Crank temperature to 2.0+ and watch it go wild. Try "qu" and notice it almost always produces "u" next — it learned English spelling patterns.
How this works: The RNN in your browser has pre-trained weights that encode English character statistics: common bigrams (th, he, in), trigrams (the, ing, tion), and even some longer patterns. It's a tiny model (16 hidden units) so it won't write Shakespeare, but it captures the feel of English.

Chapter 8: Beyond — Where RNNs Led

RNNs were the dominant sequence model from 2013 to 2017. Then the Transformer arrived and changed everything. But RNN ideas live on — and understanding them makes Transformers click.

ArchitectureYearKey IdeaLimitation
Vanilla RNN1986Hidden state = memoryVanishing gradients
LSTM1997Gated cell state highwaySequential (slow on GPU)
GRU2014Simplified LSTM (2 gates)Still sequential
Attention2015Direct connections across timeAdded to RNNs, not replacing
Transformer2017Attention is all you needQuadratic in sequence length
SSM / Mamba2023RNN-like recurrence, done rightActive research area
The RNN → Transformer lineage: RNNs proved that learned representations of sequences are powerful. Attention (first added to RNNs for machine translation) proved that direct connections beat sequential memory. The Transformer dropped the recurrence and kept the attention. Full circle: modern SSMs like Mamba bring back recurrence — but with the lessons learned.

RNNs Still Shine For

Real-time streaming
Process one token at a time with constant memory — ideal for on-device inference.
Very long sequences
No quadratic attention cost. Memory is fixed regardless of sequence length.

Explore More

Now that you understand recurrence and gating, these lessons will make much more sense:

"The unreasonable effectiveness of recurrent neural networks."
— Andrej Karpathy, 2015

RNNs taught us that a network with memory can model the structure of language, music, and code — one character at a time. That insight changed everything.