Recurrent Neural Networks — From Absolute Zero to Mastery

Chapter 0: Why Sequences Are Special

Suppose you're reading this sentence word by word. Each word changes your understanding of what comes next. "The dog sat on the ___" — your brain screams "mat" or "couch" because it remembers everything that came before.

A normal neural network sees one input and produces one output. Show it the word "the" and it has no idea what came before. It has no memory. But language, music, stock prices, video — they're all sequences where the past shapes the future.

The core problem: How do you give a neural network memory? How do you let it look at a sequence one element at a time and remember what it saw?

The Trouble With No Memory

A feedforward network sees each letter in isolation. It can't predict the next character because it doesn't know what came before. Click to step through a word.

The fix is beautifully simple: give the network a hidden state — a vector it updates at every time step and passes to itself. The network reads one element, updates its memory, reads the next, updates again. This is a recurrent neural network.

Why can't a standard feedforward network handle sequences naturally?

It's too slow to process long inputs It processes each input independently with no memory of previous inputs It can only handle numerical data, not text

Chapter 1: The Hidden State — A Network That Remembers

Think of the hidden state as a notepad the network carries with it. At each time step, it reads the current input, glances at its notepad, decides what to output, then updates the notepad for the next step.

More precisely, the hidden state h_t is a vector of numbers that encodes everything the network has seen so far. At time step t, the network combines the current input x_t with the previous hidden state h_t-1 to produce a new hidden state.

Key insight: The hidden state is a lossy compression of the entire history. It can't remember everything — it has a fixed number of slots. What it remembers depends on what the weights learned to keep.

Hidden State as a Notepad

Watch the hidden state evolve as the network reads each character. The bar chart shows which "memory slots" activate for each letter.

The hidden state size is a design choice. A 128-dimensional hidden state has 128 "memory slots." More slots means more capacity to remember, but also more parameters to train.

What does the hidden state represent?

A compressed summary of everything the network has seen so far The raw input sequence stored verbatim The final output prediction

Chapter 2: The Vanilla RNN — Three Matrices

The simplest recurrent network uses just three weight matrices and a nonlinearity. That's it. Here's the entire forward pass:

h_t = tanh( W_xh · x_t + W_hh · h_t-1 + b_h )

y_t = W_hy · h_t + b_y

Let's break each piece down:

Symbol	Shape	Meaning
x_t	[input_dim]	Current input (e.g., one-hot character)
h_t-1	[hidden_dim]	Previous hidden state (memory)
W_xh	[hidden, input]	"How should I read the input?"
W_hh	[hidden, hidden]	"How should I use my memory?"
W_hy	[output, hidden]	"What should I output?"
tanh	—	Squashes values to [-1, 1]

Think of it this way: W_xh is "what's happening now," W_hh is "what I remember," and tanh mixes them together into a new memory. W_hy reads that memory to produce output.

Vanilla RNN Forward Pass

Watch data flow through the three matrices. The tanh squashes the combined signal into [-1, 1].

python
import numpy as np

class VanillaRNN:
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.Wxh = np.random.randn(hidden_dim, input_dim) * 0.01
        self.Whh = np.random.randn(hidden_dim, hidden_dim) * 0.01
        self.Why = np.random.randn(output_dim, hidden_dim) * 0.01
        self.bh = np.zeros(hidden_dim)
        self.by = np.zeros(output_dim)

    def step(self, x, h_prev):
        h = np.tanh(self.Wxh @ x + self.Whh @ h_prev + self.bh)
        y = self.Why @ h + self.by
        return h, y  # new hidden state, output logits

In the vanilla RNN equation, what does W_hh · h_t-1 represent?

The contribution from the current input The contribution from the network's memory of the past The final output prediction

Chapter 3: Backpropagation Through Time

Training a normal network means computing gradients with backpropagation. Training an RNN is the same idea, but the computation graph is unrolled across time steps. This is called backpropagation through time (BPTT).

Imagine unfolding the RNN: instead of one recurrent cell, picture T copies of the same cell laid out left to right, each feeding its hidden state to the next. Now it looks like a very deep feedforward network — and you can backpropagate through the whole thing.

Forward

Run the RNN over all T time steps, saving every h_t

↓

Loss

Compare each output y_t to the target. Sum the losses.

↓

Backward

Backpropagate from step T to step 1 through every saved h_t

↓

Update

Sum up gradients across all time steps. Update W_xh, W_hh, W_hy.

The key realization: The same weights W_hh are multiplied at every time step. When you backpropagate through 100 steps, the gradient flows through W_hh 100 times. This is about to cause trouble.

Unrolled RNN Computation Graph

Watch the gradient flow backward through time. The same W_hh is applied at every step.

In practice, we often use truncated BPTT: instead of backpropagating through the entire sequence, we chunk it into segments (say 25 steps) and backpropagate within each chunk. This trades off long-range gradient accuracy for memory and speed.

Why is it called "backpropagation through time"?

The gradient flows backward through the unrolled time steps of the RNN The RNN can predict the future by looking back in time Training takes a long time to converge

Chapter 4: The Vanishing Gradient Problem

Here's the trouble with multiplying the same matrix over and over. Think about repeatedly multiplying a number by 0.9:

0.9 → 0.81 → 0.73 → 0.66 → ... → after 50 steps: 0.005. The signal practically disappears.

The same thing happens to gradients in an RNN. When the gradient flows back through W_hh at each time step, it gets multiplied by the derivative of tanh (which is at most 1) and by W_hh. If the largest eigenvalue of W_hh is less than 1, gradients vanish exponentially. If it's greater than 1, they explode.

Vanishing gradients: The network can learn that "a" predicts "b" one step later, but it can't learn that a word 50 steps ago matters for the current prediction. Information from the distant past gets lost.

Gradient Flow Over Time

Adjust the multiplier to see how gradients vanish (<1) or explode (>1) over 50 time steps.

Multiplier 0.90

Exploding gradients have a simple fix: gradient clipping. If the gradient norm exceeds a threshold, scale it down. Vanishing gradients are harder — you need a new architecture.

python
# Gradient clipping — the standard fix for explosions
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if grad_norm > max_norm:
    for g in grads:
        g *= max_norm / grad_norm  # scale down uniformly

If W_hh has a largest eigenvalue of 0.8, what happens to a gradient signal after 100 time steps?

It shrinks to nearly zero (0.8¹⁰⁰ ≈ 0.00000002) It stays roughly the same size It grows exponentially

Chapter 5: LSTM — Learning to Remember and Forget

The Long Short-Term Memory (LSTM) cell, invented by Hochreiter and Schmidhuber in 1997, solves the vanishing gradient problem with a brilliant idea: give the network an explicit memory cell c_t that information can flow through unchanged.

Think of it like a highway. In a vanilla RNN, every piece of information must pass through a tanh bottleneck at every step — it gets squeezed and distorted. The LSTM adds a bypass highway where information can flow across many time steps with only minor, controlled modifications.

The LSTM has three gates — each a sigmoid (0 to 1) that controls information flow. Think of them as valves on a pipe.

Gate	Formula	Role
Forget f_t	σ(W_f · [h_t-1, x_t] + b_f)	What old memory to erase (0=forget, 1=keep)
Input i_t	σ(W_i · [h_t-1, x_t] + b_i)	What new info to write (0=ignore, 1=write)
Output o_t	σ(W_o · [h_t-1, x_t] + b_o)	What memory to expose (0=hide, 1=reveal)

The cell state update is:

c_t = f_t ⊙ c_t-1 + i_t ⊙ tanh(W_c · [h_t-1, x_t] + b_c)

h_t = o_t ⊙ tanh(c_t)

Where ⊙ means element-wise multiplication. The forget gate decides how much old memory to keep. The input gate decides how much new information to add. The output gate decides what to expose as the hidden state.

LSTM Gate Visualizer

Drag the gate sliders to see how they control information flow through the cell. The cell state (orange bar) persists across time.

Forget0.90

Input0.50

Output0.70

Why this solves vanishing gradients: When the forget gate is 1 and the input gate is 0, the cell state c_t = c_t-1 exactly. Gradients flow straight back through the cell state highway — no vanishing, no explosion.

An LSTM reads a long paragraph and encounters a closing quote mark. Which gate most likely activates?

The forget gate — to erase the "inside a quote" memory The input gate — to store new information The output gate — to produce a prediction

Chapter 6: Character-Level Language Models

Here's where it all comes together. A character-level language model reads text one character at a time and learns to predict the next character. Train it on Shakespeare and it writes Shakespeare-ish text. Train it on code and it writes code-ish text.

The setup is simple:

Input

One-hot encode the current character (e.g., "h" → [0,0,0,0,0,0,0,1,0,...,0])

↓

RNN Step

Feed into RNN with previous hidden state → get new h_t

↓

Output

W_hy · h_t → logits over all characters → softmax → probabilities

↓

Loss

Cross-entropy between predicted distribution and actual next character

↻ repeat for every character in the training text

At generation time, we sample from the output distribution instead of picking the most likely character. A temperature parameter controls how adventurous the sampling is: low temperature means conservative (picks likely characters), high temperature means creative (more random).

P(c) = exp(logit_c / T) / ∑_j exp(logit_j / T)

Temperature Effect on Sampling

Given raw logits for the next character after "th", see how temperature changes the probability distribution.

Temperature 1.0

Karpathy's result: A 3-layer LSTM with 512 hidden units per layer, trained on Shakespeare for a few hours, produces text that looks remarkably like Shakespeare — with correct indentation, stage directions, and dialogue structure — despite having zero knowledge of English.

What does lowering the temperature do during character generation?

Makes the model pick higher-probability characters more often (less random) Makes training converge faster Increases the hidden state size

Chapter 7: Showcase — Live Character-Level RNN

Time to put it all together. Below is a character-level RNN running entirely in your browser. It's been trained on English text patterns — type a seed phrase and watch it generate character by character, with the hidden state evolving in real time.

Interactive Character-Level RNN

Type seed text, then click Generate. Watch the hidden state bars shift as each character is produced.

Temperature0.8

Speed (ms)120

Hidden state activation (16 dimensions):

Character probabilities (top 10):

What to try: Start with "the " at temperature 0.8 for coherent-looking text. Crank temperature to 2.0+ and watch it go wild. Try "qu" and notice it almost always produces "u" next — it learned English spelling patterns.

How this works: The RNN in your browser has pre-trained weights that encode English character statistics: common bigrams (th, he, in), trigrams (the, ing, tion), and even some longer patterns. It's a tiny model (16 hidden units) so it won't write Shakespeare, but it captures the feel of English.

Chapter 8: Beyond — Where RNNs Led

RNNs were the dominant sequence model from 2013 to 2017. Then the Transformer arrived and changed everything. But RNN ideas live on — and understanding them makes Transformers click.

Architecture	Year	Key Idea	Limitation
Vanilla RNN	1986	Hidden state = memory	Vanishing gradients
LSTM	1997	Gated cell state highway	Sequential (slow on GPU)
GRU	2014	Simplified LSTM (2 gates)	Still sequential
Attention	2015	Direct connections across time	Added to RNNs, not replacing
Transformer	2017	Attention is all you need	Quadratic in sequence length
SSM / Mamba	2023	RNN-like recurrence, done right	Active research area

The RNN → Transformer lineage: RNNs proved that learned representations of sequences are powerful. Attention (first added to RNNs for machine translation) proved that direct connections beat sequential memory. The Transformer dropped the recurrence and kept the attention. Full circle: modern SSMs like Mamba bring back recurrence — but with the lessons learned.

RNNs Still Shine For

Real-time streaming
Process one token at a time with constant memory — ideal for on-device inference.

Very long sequences
No quadratic attention cost. Memory is fixed regardless of sequence length.

Explore More

Now that you understand recurrence and gating, these lessons will make much more sense:

GPT — The Transformer that replaced RNNs for language modeling
Transformer — Attention is all you need, explained from zero
SSM / Mamba — Modern recurrence that rivals Transformers

"The unreasonable effectiveness of recurrent neural networks."

— Andrej Karpathy, 2015

RNNs taught us that a network with memory can model the structure of language, music, and code — one character at a time. That insight changed everything.

Recurrent Neural NetworksFrom Absolute Zero