RL² — Veanors

Chapter 0: The Problem

Deep RL has learned to play Atari from raw pixels, walk, run, and manipulate objects. But there is a brutal cost: the state-of-the-art Atari agent needs tens of thousands of episodes per game. To master a single game, you would need to play it for 40 days straight without sleeping.

Meanwhile, a human player in the same study needed only 2 hours.

Why the gap? Animals come equipped with prior knowledge. They have learned how learning works — how to explore efficiently, when to switch from exploration to exploitation, what kind of structure to expect in the world. Deep RL agents start from scratch every time.

The core tension: Bayesian RL provides a principled framework for incorporating prior knowledge, but exact Bayesian updates are intractable for complex environments. Practical algorithms like PILCO or guided policy search incorporate domain-specific assumptions that don't generalize. We want a general-purpose fast learning algorithm — but designing one by hand is hard. What if we could learn the learning algorithm itself?

Sample Efficiency Gap

How many episodes different agents need to master a task. The gap between deep RL and biological learners spans orders of magnitude.

Why do deep RL agents require millions of trials while animals learn in just a few?

Animals have prior knowledge about how learning works — they know how to explore, what structure to expect — while deep RL agents start from scratch every time Animals have faster neurons Deep RL uses the wrong loss function

Chapter 1: The Key Insight

Here is a radical idea: instead of designing a fast RL algorithm by hand, learn one from data.

How? Represent the learning algorithm as a recurrent neural network. The RNN receives everything a normal RL algorithm would receive — observations, actions, rewards, termination flags — and its hidden state is carried across episodes. Then train this RNN with standard RL across many different tasks.

After training, the RNN's weights encode a general-purpose learning algorithm. When you drop it into a new, unseen task, the RNN's activations (hidden state) implement the fast adaptation — essentially running the learned algorithm in its forward pass.

The name RL²: "RL squared" because there are two nested levels of reinforcement learning. The outer loop (slow) uses standard RL (TRPO) to optimize the RNN weights over many tasks. The inner loop (fast) is the RNN itself, adapting in real-time to a new task via its hidden state dynamics. The slow RL learns the fast RL.

This is "learning to learn" — meta-learning. But unlike approaches that engineer specific meta-learning architectures, RL² uses a completely general RNN and lets the optimization pressure shape it into a learning algorithm. The key constraint is that the hidden state must persist across episodes within a task, giving the RNN the capacity to accumulate information and adapt its strategy.

In RL², what encodes the "fast" learning algorithm?

The RNN's weights encode the algorithm; its hidden state activations implement the fast adaptation on each new task A separate fast-learning module bolted onto the network The reward function

Chapter 2: The RL² Framework

Let's make the two-loop structure precise. We have a distribution over MDPs, ρ_M. A trial is a sequence of n episodes on a single MDP sampled from this distribution.

The outer loop (slow)

Sample a new MDP M ~ ρ_M. Run the RNN agent for n episodes on M. Compute the total discounted reward across all n episodes. Use TRPO to update the RNN weights θ to maximize this total reward, averaged over many sampled MDPs.

The inner loop (fast)

Given a fixed set of weights θ, drop the RNN into a new, unseen MDP. The RNN starts with a blank hidden state. As it interacts with the environment — receiving states, choosing actions, seeing rewards, observing terminations — its hidden state evolves. This evolution is the fast learning. No gradient updates happen here; adaptation is purely through the RNN's forward dynamics.

A crucial design choice: The hidden state is preserved across episodes within a trial but reset between trials (different MDPs). This means information from episode 1 (what the agent explored, what rewards it got) is available in episode 2. The RNN can use early episodes to explore and later episodes to exploit — if it has learned to do so.

Two-Loop Structure

The outer loop samples MDPs and updates weights via TRPO. The inner loop is the RNN's forward pass across episodes of a single MDP — hidden state carries over between episodes.

What happens to the RNN's hidden state between episodes of the same MDP?

It is preserved — information from earlier episodes carries into later ones, enabling the RNN to accumulate knowledge and shift from exploration to exploitation It is reset to zero at the start of each episode It is copied from a reference network

Chapter 3: The RNN as an RL Algorithm

At every timestep, the RNN receives a tuple of four quantities:

s_t

The current observation (state) from the environment

a_t-1

The previous action taken by the agent

r_t-1

The reward received from the previous action

d_t-1

Termination flag (1 if previous episode ended, 0 otherwise)

These inputs are embedded through a function φ(s, a, r, d) and fed into a GRU (Gated Recurrent Unit). The GRU output passes through a fully connected layer and softmax to produce a distribution over actions.

Why these four inputs matter: This is exactly the information a hand-designed RL algorithm would use. An algorithm like UCB1 maintains visit counts and mean rewards — those are functions of past (s, a, r, d). Thompson sampling maintains a posterior distribution — also a function of past data. By feeding the RNN the same raw information, we give it the capacity to implement any such algorithm in its hidden state. The question is whether optimization actually discovers a good one.

Why GRUs? Vanilla RNNs suffer from vanishing and exploding gradients, making it hard to carry information across many timesteps. GRUs (and LSTMs) have gating mechanisms that selectively preserve or overwrite information. Since the inner loop requires remembering what happened over entire episodes, this long-range memory is essential.

The authors also experimented with architectures that explicitly reset part of the hidden state between episodes, but found no improvement. The simple approach — preserve the full hidden state — works because the termination flag d already signals episode boundaries, and the GRU can learn to use this signal however it wants.

Why does the RNN receive the termination flag d as input?

So it knows when an episode has ended and a new one begins — enabling it to distinguish exploration data from different episodes and adjust its strategy accordingly To reset the hidden state automatically To compute the discount factor

Chapter 4: Training

The outer loop trains the RNN using Trust Region Policy Optimization (TRPO). But the "environment" in the outer loop is not a single MDP — it's the meta-learning environment where:

Sample an MDP M ~ ρ_M
Run the RNN for n episodes on M, preserving hidden state across episodes
The total reward across all n episodes is the return for this "trial"
Use TRPO to update θ to maximize expected return across many trials

Why total reward across ALL episodes? This is the key pressure that forces the RNN to learn to learn. If we only maximized reward in the last episode, the agent could ignore early episodes entirely. By maximizing total reward across the trial, the agent must balance: explore enough early to learn the task, but don't waste too many episodes exploring — every episode counts.

Variance reduction

The baseline (critic) is also an RNN with GRUs, receiving the same inputs as the policy RNN. This is important: the baseline must also be able to condition on the trial history to estimate expected future return accurately. A simple state-dependent baseline would be useless in the meta-learning setting because the same state can have very different values depending on what the agent has learned so far.

The authors optionally apply Generalized Advantage Estimation (GAE) for further variance reduction, controlled by hyperparameter λ.

Training: Outer Loop TRPO

Click "Sample Trial" to simulate one outer-loop training step: sample an MDP, run the RNN for multiple episodes (hidden state preserved), compute total reward, then update weights.

Click to simulate

Why does RL² maximize total reward across ALL episodes in a trial, not just the last one?

This forces the agent to balance exploration and exploitation — it must learn quickly because every episode's reward counts, preventing wasteful exploration Because later episodes have lower reward To make TRPO converge faster

Chapter 5: Multi-Armed Bandits

The first test: can RL² learn a good exploration strategy for multi-armed bandit problems? A multi-armed bandit is a stateless MDP with k actions (arms). Pull arm i, receive reward drawn from Bernoulli(p_i). The challenge: you don't know p_i values, so you must balance exploring arms to estimate them vs. exploiting the best-known arm.

Setup

Each bandit problem is generated by sampling p_i ~ Uniform(0,1) for each arm. The RNN is trained across many such randomly generated bandits. At test time, it faces new, unseen bandit instances.

Baselines

The paper compares RL² against algorithms with decades of theoretical backing:

Gittins Index — the Bayes-optimal solution for discounted infinite-horizon bandits
Thompson Sampling — sample from the posterior, play the best arm in the sample
UCB1 — play the arm with the highest upper confidence bound
ε-Greedy and Greedy baselines

The result: RL² achieves performance comparable to the Gittins index — a theoretically optimal algorithm — across most settings. In settings with k=5 or k=10 arms, RL² is statistically indistinguishable from optimal. The RNN has learned, from scratch, an exploration strategy that rivals algorithms designed by humans over 40+ years of research.

Bandit Exploration Strategy

Simulated RL² agent vs. Greedy on a 5-arm bandit. Click "New Bandit" to sample new arm probabilities and watch both strategies play 50 rounds. Notice how RL² explores systematically before committing.

Click to simulate

There is a gap in the hardest setting (k=50 arms, n=500 episodes). To check whether this is an architecture limitation or an optimization limitation, the authors trained the same RNN architecture via supervised learning on trajectories from the Gittins index. The supervised policy matched Gittins performance, proving the architecture has enough capacity — the bottleneck is the RL optimization in the outer loop.

In multi-armed bandits, what does RL² learn that makes it competitive with the Gittins index?

An implicit exploration-exploitation strategy encoded in the RNN's hidden state — effectively rediscovering near-optimal Bayesian exploration from scratch A lookup table of arm values The exact Gittins index formula

Chapter 6: Tabular MDPs

Bandits are stateless — they test exploration vs. exploitation but not sequential decision making. The next test: randomly generated tabular MDPs with |S|=10 states, |A|=5 actions, horizon T=10 per episode.

The distribution over MDPs

Transition probabilities are sampled from a flat Dirichlet distribution. Rewards are Gaussian with unit variance, means sampled from Normal(1,1). This matches the standard prior used in Bayesian RL methods, giving those methods their best-case advantage.

Baselines

PSRL (Posterior Sampling RL) — Thompson sampling generalized to MDPs: sample an MDP from the posterior, solve for the optimal policy, play it for one episode, update the posterior
OPSRL — optimistic variant of PSRL
UCRL2 — compute optimal policy for the most optimistic MDP consistent with observations
BEB — add exploration bonus to unvisited state-action pairs

Surprising result for small n: When the number of episodes is small (n=10), RL² outperforms all baselines by a large margin. Why? With only 10 episodes of horizon 10, you get at most 100 transitions — far too few to estimate the 140 parameters of the MDP (10×5 reward means + 10×5×10 transition probabilities divided by normalization). Bayesian methods try to estimate the full MDP and solve it, which is wasteful when data is so scarce. RL² learns to exploit sooner — it discovers that aggressive exploitation is better than thorough exploration when the budget is tiny.

As n grows (50, 75, 100 episodes), the Bayesian methods catch up and eventually match or exceed RL². With more data, thorough exploration pays off, and the Bayesian algorithms' theoretical guarantees kick in. The RL optimization in the outer loop also becomes harder with longer trials.

Why does RL² outperform Bayesian methods like PSRL when the number of episodes is very small (n=10)?

With too few episodes to estimate the full MDP, RL² learns to exploit aggressively rather than wasting episodes on thorough exploration — a strategy that Bayesian methods don't adopt RL² uses a better discount factor The Bayesian methods have bugs in their implementation

Chapter 7: Visual Navigation

The previous experiments used tiny state spaces. Can RL² scale to high-dimensional observations? The paper tests on a vision-based maze navigation task using the ViZDoom engine.

The task

The agent sees the maze from a first-person view (raw pixels). A target block (red) is placed somewhere in a randomly generated 5×5 maze. Rewards: +1 for reaching the target, −0.001 for hitting walls, −0.04 per timestep. The agent gets 2 episodes per trial (same maze, same target), each up to 250 steps.

The optimal strategy: In episode 1, explore the maze to find the target. In episode 2, navigate directly to it using what you learned in episode 1. This requires the RNN to remember the maze layout and target location across episodes — using only the hidden state, with no explicit map or memory module.

Results

The agent achieves 99.3% success rate in episode 1 (finding the target via exploration) and 99.6% in episode 2. Crucially, average trajectory length drops from 52.4 steps in episode 1 to 39.1 in episode 2 — the agent is taking more direct paths because it remembers where the target is.

Even more impressive: the agent generalizes to 9×9 mazes it was never trained on, achieving 97.1% success. And it maintains performance across 5 episodes (not just the 2 it was trained with).

Maze Navigation: Episode 1 vs 2

Simulated maze navigation. Episode 1: the agent explores. Episode 2: it takes a more direct path. Click "New Maze" to generate a random layout.

Click to simulate

Failure modes

The behavior is not always perfect. Sometimes the agent "forgets" the target location and continues exploring in episode 2. The authors observe this happens occasionally and attribute it to the difficulty of the outer-loop RL optimization. Better outer-loop algorithms should reduce these failures.

How does the RL² agent demonstrate that it has learned to reuse information across episodes in the maze task?

Average trajectory length drops from 52 steps (episode 1) to 39 steps (episode 2) — the agent takes more direct paths because it remembers the target location It achieves 100% success rate It builds an explicit map in memory

Chapter 8: What the RNN Learns

The most fascinating question: what algorithm does the RNN actually implement? We can't inspect the hidden state directly (it's a high-dimensional, distributed representation), but we can infer the algorithm from the agent's behavior.

Implicit Thompson sampling

In the bandit setting, the RNN's behavior closely matches Thompson sampling — the Bayesian strategy of sampling from the posterior distribution over arm parameters and playing the best arm in the sample. The RNN appears to maintain an implicit posterior in its hidden state, updating it as rewards are observed.

Posterior updating

When the agent receives a reward from an arm, its future behavior shifts as if it has updated a belief distribution. High rewards on an arm increase the probability of pulling that arm again, but not monotonically — the agent shows diminishing returns from additional samples of the same arm, consistent with Bayesian updating where the posterior narrows.

Uncertainty-driven exploration

The agent preferentially explores actions it is uncertain about. Arms that haven't been pulled receive more exploration, especially in early episodes. This mirrors UCB-style optimism in the face of uncertainty — but the RNN wasn't told to do this. It discovered this strategy through optimization pressure.

The deep insight: The RNN has independently rediscovered core principles of Bayesian decision theory — posterior updating, Thompson sampling, optimism under uncertainty — without any explicit Bayesian machinery. These principles emerge naturally from optimizing total reward across many tasks. The "prior" is encoded in the weights; the "posterior update" is the hidden state dynamics; the "sampling" is the stochastic policy output.

Implicit Bayesian Updating

Schematic of how the RNN's hidden state implements implicit posterior updating. After each reward, the effective "belief" over arm quality narrows. Blue = prior/wide uncertainty, teal = posterior after observations.

0 observations

What exploration strategy does the RL² agent's behavior most closely resemble in the bandit setting?

Implicit Thompson sampling — the RNN maintains an approximate posterior in its hidden state and samples actions proportionally, a strategy it discovered without any explicit Bayesian machinery Pure random exploration Exhaustive enumeration of all possible arm orderings

Chapter 9: Connections

What came before

Bayesian RL (Strens, 2000; Ghavamzadeh et al., 2015): The theoretical framework for incorporating prior knowledge into RL. RL² can be seen as an approximate, scalable implementation of Bayesian RL — the RNN implicitly performs Bayesian inference in its hidden state.

Learning to learn (Thrun & Pratt, 1998): The broader meta-learning program. RL² instantiates this idea in RL by encoding the learning algorithm in a neural network.

Memory-Augmented Neural Networks (Santoro et al., 2016): Used external memory for meta-learning in supervised settings. RL² achieves similar goals using only the RNN's internal hidden state.

What came after

MAML (Finn et al., 2017): Model-Agnostic Meta-Learning takes a different approach — learn an initialization that can be quickly fine-tuned with a few gradient steps. MAML does explicit gradient updates in the inner loop; RL² does implicit adaptation via forward dynamics. Both are meta-learning, but with fundamentally different inner-loop mechanisms.

In-Context Learning in LLMs (Brown et al., 2020): GPT-3's ability to learn from examples in its context window is strikingly similar to RL². A transformer is trained on many tasks (outer loop); at inference, it adapts to new tasks via its forward pass (inner loop) without weight updates. RL² arguably foreshadowed this phenomenon — both show that a sequence model trained on many tasks can implement a learning algorithm in its activations.

Decision Transformers (Chen et al., 2021): Formulate RL as sequence modeling, feeding (s, a, r) tuples to a transformer. This extends the RL² philosophy: instead of an RNN, use a transformer as the "learning algorithm."

RL²'s legacy: This paper was among the first to show that a neural network can learn a reinforcement learning algorithm purely from experience, without being told what exploration or exploitation mean. It opened the door to meta-RL as a field, and the core insight — that a sequence model's forward pass can implement adaptation — is now understood as a unifying principle connecting meta-learning, in-context learning, and foundation models.

Cheat sheet

Core idea

Learn the RL algorithm itself: encode it in RNN weights, run it via forward pass

Outer loop

TRPO across many sampled MDPs — updates RNN weights (slow learning)

Inner loop

RNN forward pass on new MDP — hidden state adapts (fast learning)

Key inputs

(s, a, r, d) — observation, previous action, reward, termination flag

Impact

Pioneered meta-RL; foreshadowed in-context learning in LLMs

How does RL² connect to in-context learning in large language models like GPT-3?

Both use a sequence model trained on many tasks (outer loop) that adapts to new tasks via its forward pass (inner loop) without weight updates — the model's activations implement the learning algorithm Both use TRPO for training GPT-3 was trained using RL² directly

RL²: Fast RL via Slow RL

Chapter 0: The Problem

Chapter 1: The Key Insight

Chapter 2: The RL² Framework

The outer loop (slow)

The inner loop (fast)

Chapter 3: The RNN as an RL Algorithm

Chapter 4: Training

Variance reduction

Chapter 5: Multi-Armed Bandits

Setup

Baselines

Chapter 6: Tabular MDPs

The distribution over MDPs

Baselines

Chapter 7: Visual Navigation

The task

Results

Failure modes

Chapter 8: What the RNN Learns

Implicit Thompson sampling

Posterior updating

Uncertainty-driven exploration

Chapter 9: Connections

What came before

What came after

Cheat sheet