Instead of hand-designing a fast reinforcement learning algorithm, learn one: encode the entire learning algorithm in the weights of an RNN, trained slowly across many tasks.
Deep RL has learned to play Atari from raw pixels, walk, run, and manipulate objects. But there is a brutal cost: the state-of-the-art Atari agent needs tens of thousands of episodes per game. To master a single game, you would need to play it for 40 days straight without sleeping.
Meanwhile, a human player in the same study needed only 2 hours.
Why the gap? Animals come equipped with prior knowledge. They have learned how learning works — how to explore efficiently, when to switch from exploration to exploitation, what kind of structure to expect in the world. Deep RL agents start from scratch every time.
How many episodes different agents need to master a task. The gap between deep RL and biological learners spans orders of magnitude.
Here is a radical idea: instead of designing a fast RL algorithm by hand, learn one from data.
How? Represent the learning algorithm as a recurrent neural network. The RNN receives everything a normal RL algorithm would receive — observations, actions, rewards, termination flags — and its hidden state is carried across episodes. Then train this RNN with standard RL across many different tasks.
After training, the RNN's weights encode a general-purpose learning algorithm. When you drop it into a new, unseen task, the RNN's activations (hidden state) implement the fast adaptation — essentially running the learned algorithm in its forward pass.
This is "learning to learn" — meta-learning. But unlike approaches that engineer specific meta-learning architectures, RL² uses a completely general RNN and lets the optimization pressure shape it into a learning algorithm. The key constraint is that the hidden state must persist across episodes within a task, giving the RNN the capacity to accumulate information and adapt its strategy.
Let's make the two-loop structure precise. We have a distribution over MDPs, ρM. A trial is a sequence of n episodes on a single MDP sampled from this distribution.
Sample a new MDP M ~ ρM. Run the RNN agent for n episodes on M. Compute the total discounted reward across all n episodes. Use TRPO to update the RNN weights θ to maximize this total reward, averaged over many sampled MDPs.
Given a fixed set of weights θ, drop the RNN into a new, unseen MDP. The RNN starts with a blank hidden state. As it interacts with the environment — receiving states, choosing actions, seeing rewards, observing terminations — its hidden state evolves. This evolution is the fast learning. No gradient updates happen here; adaptation is purely through the RNN's forward dynamics.
The outer loop samples MDPs and updates weights via TRPO. The inner loop is the RNN's forward pass across episodes of a single MDP — hidden state carries over between episodes.
At every timestep, the RNN receives a tuple of four quantities:
These inputs are embedded through a function φ(s, a, r, d) and fed into a GRU (Gated Recurrent Unit). The GRU output passes through a fully connected layer and softmax to produce a distribution over actions.
Why GRUs? Vanilla RNNs suffer from vanishing and exploding gradients, making it hard to carry information across many timesteps. GRUs (and LSTMs) have gating mechanisms that selectively preserve or overwrite information. Since the inner loop requires remembering what happened over entire episodes, this long-range memory is essential.
The authors also experimented with architectures that explicitly reset part of the hidden state between episodes, but found no improvement. The simple approach — preserve the full hidden state — works because the termination flag d already signals episode boundaries, and the GRU can learn to use this signal however it wants.
The outer loop trains the RNN using Trust Region Policy Optimization (TRPO). But the "environment" in the outer loop is not a single MDP — it's the meta-learning environment where:
The baseline (critic) is also an RNN with GRUs, receiving the same inputs as the policy RNN. This is important: the baseline must also be able to condition on the trial history to estimate expected future return accurately. A simple state-dependent baseline would be useless in the meta-learning setting because the same state can have very different values depending on what the agent has learned so far.
The authors optionally apply Generalized Advantage Estimation (GAE) for further variance reduction, controlled by hyperparameter λ.
Click "Sample Trial" to simulate one outer-loop training step: sample an MDP, run the RNN for multiple episodes (hidden state preserved), compute total reward, then update weights.
The first test: can RL² learn a good exploration strategy for multi-armed bandit problems? A multi-armed bandit is a stateless MDP with k actions (arms). Pull arm i, receive reward drawn from Bernoulli(pi). The challenge: you don't know pi values, so you must balance exploring arms to estimate them vs. exploiting the best-known arm.
Each bandit problem is generated by sampling pi ~ Uniform(0,1) for each arm. The RNN is trained across many such randomly generated bandits. At test time, it faces new, unseen bandit instances.
The paper compares RL² against algorithms with decades of theoretical backing:
Simulated RL² agent vs. Greedy on a 5-arm bandit. Click "New Bandit" to sample new arm probabilities and watch both strategies play 50 rounds. Notice how RL² explores systematically before committing.
There is a gap in the hardest setting (k=50 arms, n=500 episodes). To check whether this is an architecture limitation or an optimization limitation, the authors trained the same RNN architecture via supervised learning on trajectories from the Gittins index. The supervised policy matched Gittins performance, proving the architecture has enough capacity — the bottleneck is the RL optimization in the outer loop.
Bandits are stateless — they test exploration vs. exploitation but not sequential decision making. The next test: randomly generated tabular MDPs with |S|=10 states, |A|=5 actions, horizon T=10 per episode.
Transition probabilities are sampled from a flat Dirichlet distribution. Rewards are Gaussian with unit variance, means sampled from Normal(1,1). This matches the standard prior used in Bayesian RL methods, giving those methods their best-case advantage.
As n grows (50, 75, 100 episodes), the Bayesian methods catch up and eventually match or exceed RL². With more data, thorough exploration pays off, and the Bayesian algorithms' theoretical guarantees kick in. The RL optimization in the outer loop also becomes harder with longer trials.
The previous experiments used tiny state spaces. Can RL² scale to high-dimensional observations? The paper tests on a vision-based maze navigation task using the ViZDoom engine.
The agent sees the maze from a first-person view (raw pixels). A target block (red) is placed somewhere in a randomly generated 5×5 maze. Rewards: +1 for reaching the target, −0.001 for hitting walls, −0.04 per timestep. The agent gets 2 episodes per trial (same maze, same target), each up to 250 steps.
The agent achieves 99.3% success rate in episode 1 (finding the target via exploration) and 99.6% in episode 2. Crucially, average trajectory length drops from 52.4 steps in episode 1 to 39.1 in episode 2 — the agent is taking more direct paths because it remembers where the target is.
Even more impressive: the agent generalizes to 9×9 mazes it was never trained on, achieving 97.1% success. And it maintains performance across 5 episodes (not just the 2 it was trained with).
Simulated maze navigation. Episode 1: the agent explores. Episode 2: it takes a more direct path. Click "New Maze" to generate a random layout.
The behavior is not always perfect. Sometimes the agent "forgets" the target location and continues exploring in episode 2. The authors observe this happens occasionally and attribute it to the difficulty of the outer-loop RL optimization. Better outer-loop algorithms should reduce these failures.
The most fascinating question: what algorithm does the RNN actually implement? We can't inspect the hidden state directly (it's a high-dimensional, distributed representation), but we can infer the algorithm from the agent's behavior.
In the bandit setting, the RNN's behavior closely matches Thompson sampling — the Bayesian strategy of sampling from the posterior distribution over arm parameters and playing the best arm in the sample. The RNN appears to maintain an implicit posterior in its hidden state, updating it as rewards are observed.
When the agent receives a reward from an arm, its future behavior shifts as if it has updated a belief distribution. High rewards on an arm increase the probability of pulling that arm again, but not monotonically — the agent shows diminishing returns from additional samples of the same arm, consistent with Bayesian updating where the posterior narrows.
The agent preferentially explores actions it is uncertain about. Arms that haven't been pulled receive more exploration, especially in early episodes. This mirrors UCB-style optimism in the face of uncertainty — but the RNN wasn't told to do this. It discovered this strategy through optimization pressure.
Schematic of how the RNN's hidden state implements implicit posterior updating. After each reward, the effective "belief" over arm quality narrows. Blue = prior/wide uncertainty, teal = posterior after observations.
Bayesian RL (Strens, 2000; Ghavamzadeh et al., 2015): The theoretical framework for incorporating prior knowledge into RL. RL² can be seen as an approximate, scalable implementation of Bayesian RL — the RNN implicitly performs Bayesian inference in its hidden state.
Learning to learn (Thrun & Pratt, 1998): The broader meta-learning program. RL² instantiates this idea in RL by encoding the learning algorithm in a neural network.
Memory-Augmented Neural Networks (Santoro et al., 2016): Used external memory for meta-learning in supervised settings. RL² achieves similar goals using only the RNN's internal hidden state.
MAML (Finn et al., 2017): Model-Agnostic Meta-Learning takes a different approach — learn an initialization that can be quickly fine-tuned with a few gradient steps. MAML does explicit gradient updates in the inner loop; RL² does implicit adaptation via forward dynamics. Both are meta-learning, but with fundamentally different inner-loop mechanisms.
In-Context Learning in LLMs (Brown et al., 2020): GPT-3's ability to learn from examples in its context window is strikingly similar to RL². A transformer is trained on many tasks (outer loop); at inference, it adapts to new tasks via its forward pass (inner loop) without weight updates. RL² arguably foreshadowed this phenomenon — both show that a sequence model trained on many tasks can implement a learning algorithm in its activations.
Decision Transformers (Chen et al., 2021): Formulate RL as sequence modeling, feeding (s, a, r) tuples to a transformer. This extends the RL² philosophy: instead of an RNN, use a transformer as the "learning algorithm."