General Intelligence Requires Rethinking Exploration

Chapter 0: The Data Problem

Imagine you're building a virtual assistant that can shop for you, answer questions by searching the web, hail a cab, and recommend movies. You train it on a massive conversational corpus — billions of tokens of dialogue from real people. You use the best transformer architecture, the latest optimizer, the most careful hyperparameter tuning.

And it works. For a while.

Then new products launch, cultural references shift, websites redesign their interfaces, and the world moves on. Your assistant, trained on a static snapshot of the world, grows stale. It doesn't know about the new streaming service everyone's talking about. It navigates a website that no longer exists. It gives advice based on last year's prices.

The standard response is: "collect more data, retrain." But which data? From where? How do you know what's missing? How do you avoid just reinforcing the same biases your model already has?

The thesis of this paper: The central bottleneck in AI is no longer how to train models — it's what data to train them on. Both supervised learning and reinforcement learning are fundamentally limited by static data. Achieving general intelligence requires a new kind of exploration that goes beyond any single dataset or simulator.

This claim might feel surprising. After all, the past decade's progress has been dominated by model innovation — transformers, diffusion models, RLHF. But the authors argue this focus on model design represents a systematic underinvestment in exploration — the process of collecting or creating the data these models learn from.

To make this concrete, let's see what happens when you train on a fixed dataset and the world changes.

The Staleness Problem

Your assistant was trained in January. Watch its knowledge decay as the world evolves. Click "Advance Month" to see what goes wrong.

January (training data)

The deeper problem isn't just that the world changes — it's that any finite dataset is necessarily incomplete. The set of all facts about the world is infinite. The set of all tasks a general agent might face is infinite. No matter how large your training corpus, there will always be gaps — and those gaps are exactly where your model fails.

The paper introduces a key framing: Increasingly General Intelligence (IGI). Rather than trying to define "AGI" as some fixed endpoint, they define it as a relative property: Model A is more general than Model B if A succeeds on more tasks while matching B's performance everywhere else. An IGI is a system that keeps getting more general over time. The question becomes: what kind of learning process produces an IGI?

Increasingly General Intelligence: We don't need to define what "general" means in absolute terms. We just need a system that keeps getting better at more things. If your model solves 100 tasks today and 110 tomorrow (without losing any), it's an IGI. The question is: what mechanism drives this continual expansion of capability?

The authors' answer: exploration. Not exploration within a single game or environment, but exploration across the entire space of possible data. This is "generalized exploration" — and current AI systems don't do it.

What is the paper's central claim about the bottleneck preventing more general AI?

We need better model architectures The bottleneck has shifted from how to train models to what data to train them on — both SL and RL are limited by static data We need more compute

Chapter 1: The Exploration Gap

To understand why the data problem is so fundamental, we need to see exactly how both supervised learning and reinforcement learning hit the same wall — just from different directions.

The Limits of Supervised Learning

SL learns a function f mapping inputs to outputs from a training set D_train. The model minimizes loss on this fixed dataset. The defining assumption: the training data is collected once and never changes.

This gives SL two unavoidable weaknesses:

Incompleteness. No finite dataset captures all relevant information. There are always missing facts, unrepresented scenarios, edge cases that never appeared in the training corpus. A language model trained on English-language web data has gaping holes in low-resource languages, specialized domains, and anything that wasn't commonly discussed online.

Stationarity. A static dataset is, by definition, frozen in time. But the real world is non-stationary: culture shifts, technology changes, new concepts emerge. A model trained on 2022 data doesn't know about 2023's events. Worse, it doesn't know what it doesn't know.

Think about it this way: Scaling laws tell us that test loss decreases with more data. But scaling laws only hold within the distribution of the training data. They say nothing about performance on data from a different distribution — which is exactly the data your deployed model will encounter tomorrow.

The Limits of Reinforcement Learning

RL looks like it should solve this problem. After all, RL agents explore — they take actions specifically to discover new information. The exploration-exploitation tradeoff is a central concern of the field.

But here's the catch: RL exploration happens inside a simulator, and the simulator is finite.

Consider training a web navigation agent. You build a simulator of websites. The agent explores different click sequences, discovers pages, learns navigation strategies. But the simulator only contains the websites you programmed into it. It can't represent next year's website redesigns, new interaction paradigms, or services that don't exist yet.

The simulator in RL plays the same role as the training dataset in SL: both define a fixed distribution D over training data. RL exploration just samples from this distribution more cleverly — it doesn't expand it.

SL vs. RL: The Same Wall

Both SL and RL are bounded by their data source. Toggle between them to see how each hits the same fundamental limit.

This is the key insight that makes the paper's argument click: exploration in RL, as currently practiced, doesn't actually explore the full data space. It only explores within the confines of a pre-built simulator. When we deploy the agent in the real world, it faces a "sim2real gap" — the mismatch between simulation and reality. This is the same problem as training an SL model on a static dataset and deploying it on out-of-distribution data.

The parallel: A static SL dataset is to supervised learning what a static simulator is to RL. Both are finite data sources that cannot represent the full space of possible experiences. Both produce models that are fundamentally brittle outside their training distribution.

The paper even notes that the situation may be worse for RL: we have scaling laws showing that SL test loss improves with more data, but no equivalent result for RL transfer to new tasks. Static simulators may impose an even more severe data limitation than large curated datasets.

Why does standard RL exploration NOT solve the data limitation problem?

RL agents don't explore enough RL is too sample-inefficient RL exploration only samples more cleverly from a fixed simulator — it doesn't expand the set of available training data

Chapter 2: Two Loops

Now we have a problem: both SL and RL are trapped inside fixed data distributions. What would a solution look like?

The paper proposes separating learning into two nested processes — an inner loop and an outer loop — that together can break free from static data.

Inner Loop: Prioritized Training

The inner loop is what we already know how to do: given a fixed set of training data, learn from it as efficiently as possible. This includes:

In SL: Active learning (choose which data points to label next), curriculum learning (present examples in an order that maximizes learning speed), hard example mining
In RL: Exploration within the simulator (ε-greedy, curiosity-driven exploration, count-based methods), prioritized experience replay (sample transitions that are most informative)

Both are forms of prioritized sampling: choosing which data points from the existing pool to train on, in what order, and with what frequency. They make training more efficient, but they can't introduce new information that isn't already in the data source.

Outer Loop: Active Collection

The outer loop is the missing piece: a process that expands the training data itself. It searches beyond the current dataset or simulator to find new data that would be most informative for improving the model.

Active collection goes beyond just "collect more data." It must be deliberate — seeking data that addresses specific weaknesses in the current model, fills gaps in coverage, and stays relevant to real tasks of interest.

The crucial distinction: Prioritized training (inner loop) asks "which of my existing data should I train on next?" Active collection (outer loop) asks "what new data should I go find?" The first optimizes within a fixed distribution. The second expands the distribution itself.

The Two-Loop Framework

Watch how the inner loop trains on existing data while the outer loop expands the data pool. Click "Run Inner Loop" to see prioritized training, then "Run Outer Loop" to see active collection.

In the paper's Figure 1, this framework is applied to both SL and RL:

SL outer loop: Online or offline collection of new labeled data — crawling the web for new examples, collecting user interaction data, generating synthetic data with generative models.

RL outer loop: Finding new simulator settings (or even new simulators entirely) that present challenges the agent can't yet solve. This is where environment design enters the picture.

The paper calls this combined process generalized exploration: the outer loop actively collects new data while the inner loop efficiently trains on it. When the data space is unbounded, this becomes open-ended exploration, and a model trained on such a stream performs open-ended learning.

Why Both Loops Matter

Without the inner loop, you waste data — training on uninformative examples while ignoring the most useful ones. Without the outer loop, you stagnate — no matter how clever your sampling strategy, you can't learn what isn't in your data.

A system with both loops can, in principle, continually improve: the outer loop finds new challenges, the inner loop masters them, and the cycle repeats. This is the mechanism for producing an IGI.

What is the key difference between the inner loop (prioritized training) and the outer loop (active collection)?

The inner loop optimizes which existing data to train on; the outer loop expands the data pool itself by finding or creating new data The inner loop is for SL and the outer loop is for RL The inner loop runs faster than the outer loop

Chapter 3: The Bootstrap Trap

Couldn't you just skip the complicated active collection business and let the model learn from its own deployment interactions? After all, a deployed model generates lots of data. Just retrain on it.

This is exactly what many production systems do. And it leads to a trap.

The Feedback Loop

In online learning, the model's predictions influence the data it sees next. This creates a causal loop:

The model makes predictions based on its current parameters θ
These predictions change the environment (users click on recommendations, navigate suggested routes, interact with suggested content)
The model collects data from these changed interactions
The model retrains on this data, updating θ
Go to step 1

When does this loop stop changing? When the model's parameters produce predictions that generate data that, when trained on, reproduce the same parameters. This is a fixed point:

θ* = argmin_θ E_{Z ~ D(θ*)} [L(Z, θ)]

At this fixed point, the model fully determines its own training distribution. And that distribution no longer challenges the model. Learning stops.

The bootstrap problem: When a model trains on data generated by its own predictions, it can lock into a local optimum where its biases become self-reinforcing. The model stops encountering challenging data because its own behavior ensures it only sees data it already handles well.

A Concrete Example

Your virtual assistant shops for you online. It learns that Amazon has good deals, so it visits Amazon more often. Its online training data becomes dominated by Amazon pages. It gets even better at navigating Amazon, so it visits Amazon even more. Eventually, it never visits any other website — even when much better deals exist elsewhere.

This isn't hypothetical. The paper cites real-world examples: content recommendation systems that amplify polarization, AI-powered lending systems that reinforce socioeconomic disparities. In each case, the model's predictions shape its own training data, locking it into a narrow equilibrium.

The Bootstrap Trap in Action

An agent explores 5 websites. Watch it get trapped in a self-reinforcing loop. Then enable "Active Collection" to see how deliberate exploration breaks the trap.

Step 0 — No bias yet

This problem has been studied under different names in different fields: one-sided learning in contextual bandits, strategic classification in game theory, and performative prediction in the ML theory literature. The paper brings these together under a single framing.

Why Standard Exploration Isn't Enough

Standard RL exploration techniques (ε-greedy, curiosity, count-based methods) can help avoid some local optima within a fixed environment. But they can't fix the bootstrap problem when the environment itself is shaped by the agent's behavior. And they certainly can't help when the relevant information isn't in the simulator at all.

Active collection — the outer loop — is the solution: it brings in data from outside the model's current influence, data that challenges the model's biases precisely because it wasn't generated by the model's own predictions.

What is the bootstrap problem in online learning?

The model's predictions shape its own training data, creating a feedback loop that locks it into a local optimum where learning stagnates The model is too large to train efficiently The model forgets old data when learning new data

Chapter 4: Exploration in a Static Environment

Before we can rethink exploration, we need to understand how it currently works. Let's look at exploration within a single, fixed MDP — the setting that RL has studied for decades.

Exploitation vs. Exploration Policies

In a fixed environment, the agent maintains two strategies:

The exploitation policy takes the best-known action in each state — the action that maximizes expected return based on what the agent has learned so far.

The exploration policy deliberately takes actions that generate novel data — states the agent hasn't seen before, transitions that look different from past experience. The goal is to collect information that could lead to discovering better strategies.

Measuring Novelty

The exploration policy maximizes some intrinsic reward I(s, a) — a signal that reflects how "interesting" a transition is. Common approaches:

Count-based: States visited fewer times are more novel. If you've visited state s 100 times but state s' only twice, s' gets a higher intrinsic reward. In practice, exact counting doesn't scale to large state spaces, so we use approximations — pseudo-counts, hash-based counts, or density models.

Prediction error: If a forward prediction model can't accurately predict the outcome of a transition, that transition is novel. The prediction error serves as a proxy for novelty. Random Network Distillation (RND) is a popular variant: train a network to predict the output of a randomly-initialized network, and use the prediction error as intrinsic reward.

Epistemic uncertainty: Train an ensemble of models and measure their disagreement. High disagreement means the models are uncertain about what happens in that region of state space — so it's worth exploring.

The stochastic trap problem: Naive novelty-seeking can get stuck on states with inherent randomness. A coin flip always produces "surprising" outcomes, but watching more coin flips teaches nothing. Robust methods must separate epistemic uncertainty (what we don't know yet but could learn) from aleatoric uncertainty (irreducible randomness). The paper notes that evaluating novelty over a batch of trajectories can implicitly remove aleatoric noise.

Combining Exploration and Exploitation

There are two main approaches to balancing exploration and exploitation:

Separate policies: A "behavior policy" collects data through exploration, while a "target policy" is trained on this data using importance sampling. This cleanly separates concerns but requires off-policy correction.

Single policy: One policy maximizes a weighted sum of extrinsic return (task reward) and intrinsic return (novelty bonus). As exploration reduces uncertainty, the intrinsic reward naturally decays toward zero, leaving an approximately pure exploitation policy. Methods like RND and ICM use this approach.

Exploration Strategies in a Grid World

An agent explores a 10×10 grid. Compare random exploration vs. count-based exploration vs. prediction-error exploration. Watch how coverage evolves over time.

Coverage: 1%

The Fundamental Limitation

All of these methods share a critical property: they only explore within the state space defined by the simulator. Count-based exploration visits every state in the grid, but it can't discover states that aren't in the grid. Prediction-error exploration finds surprising transitions, but only transitions that the environment can produce.

For our virtual assistant, these methods might help it discover all possible navigation paths within a simulated website. But they can't help it handle websites that aren't in the simulation. The exploration space is bounded by the environment definition.

Why can't standard RL exploration methods (ε-greedy, curiosity, count-based) address the data limitation problem?

They only explore within the states and transitions defined by the current simulator — they can't discover or create new environments They are too slow to converge They require too much memory

Chapter 5: Exploration in Environment Space

If exploring within one environment isn't enough, what if we explore across many environments? This is where the paper gets to its most powerful idea.

Parameterized Environments

Recent RL research has moved from training in single environments to training across distributions of environments. Instead of one maze, you have a parameterized maze generator. Instead of one website, you have a space of possible website configurations.

The simplest approach is domain randomization: randomly vary the environment parameters during training. This can produce surprisingly robust policies. But random sampling wastes training time on environments that are too easy (the agent already solves them) or too hard (the agent can't learn anything from them).

Curriculum Games

The paper highlights Unsupervised Environment Design (UED), which frames environment generation as a game between a teacher (which proposes environment configurations) and a student (which tries to solve them). The teacher's goal is to generate environments that maximize the student's learning potential.

What makes an environment have high learning potential? Three criteria:

Criteria for High Learning Potential

Criterion	Meaning	If Violated
Improvability	The agent doesn't fully succeed — there is room to improve	No learning signal; the agent already solved it
Learnability	The agent can efficiently learn to improve	Agent wastes time on impossibly hard environments
Consistency	Solutions are consistent across configurations	Learning one environment harms performance on another

This mirrors Vygotsky's Zone of Proximal Development from developmental psychology: children learn best when challenged with tasks just beyond their current ability — not too easy, not impossibly hard.

The curriculum game analogy: Imagine a tutor designing problems for a student. The best tutor doesn't give problems the student already aces (no learning) or problems far beyond their level (no progress). They find problems at the frontier of the student's abilities — where struggle leads to growth. This is exactly what the teacher does in a curriculum game.

When the curriculum game is zero-sum — the teacher receives the agent's regret (the gap between the agent's performance and the optimal performance) as its reward — then at Nash equilibrium, the student plays the minimax regret policy. This is a policy that minimizes the worst-case performance gap across all possible environments. It's provably robust.

The Curriculum Game

A teacher generates maze environments for a student agent. Compare domain randomization vs. adaptive curriculum (UED). Watch how the agent's capability grows under each strategy.

Still Not Enough

Curriculum games are a major step forward — the agent now explores across environments, not just within one. But even parameterized environments are still bounded: the parameters only modify a fixed simulator. The maze generator can create different mazes, but it can't create something that isn't a maze.

For true generality, we need to explore the full space of possible environments — not just configurations within one simulator. This is the final leap the paper proposes.

What three criteria must an environment satisfy to have high learning potential for the agent?

Improvability (room to improve), learnability (agent can learn from it), consistency (solutions don't conflict across environments) Large state space, dense rewards, deterministic transitions High difficulty, many opponents, random initial conditions

Chapter 6: The Open-Ended Criterion

Now we reach the paper's core contribution: a formal criterion for open-ended exploration that unifies everything we've discussed into a single framework.

Searching the Space of All MDPs

The paper defines a search process G(π) that takes the current policy and returns a distribution over MDPs — programs implementing decision processes. This search aims to find MDPs that maximize learning potential for the current agent:

m* = argmax_{m ∈ M} C(m, π)

where C(m, π) measures how much the agent can learn from environment m. Simple enough. But if we just maximize learning potential, three problems arise:

Problem 1: Malformed programs. Most random programs aren't valid MDPs. A naive generative search would waste astronomical compute generating broken environments.

Problem 2: Irrelevant environments. Even valid MDPs might have nothing to do with the tasks we care about. An agent that masters random cellular automata isn't any better at web navigation.

Problem 3: Narrow focus. The search might fixate on a small region of environment space, missing entirely different types of useful challenges.

The Full Criterion: Equation 4

To address all three problems, the paper augments the search criterion with diversity and grounding terms:

m* = argmax_{m ∈ M} [C(m, π) + α_D Σ_{m_i ∈ M̂} Δ_D(m, m_i) − Σ_{m_k ∈ M} α_k · Δ_G(m, m_k)]

Let's unpack each term:

The Three Components of Open-Ended Exploration

Term	Symbol	Role	Intuition
Learning Potential	C(m, π)	Find environments where the agent can improve	"Is this challenge useful for growth?"
Diversity	α_D Σ Δ_D(m, m_i)	Encourage spread across environment space	"Is this challenge different from what we've already found?"
Grounding	−Σ α_k Δ_G(m, m_k)	Keep search near real tasks of interest	"Is this challenge relevant to what we actually care about?"

Here M̂ is a queue of the best solutions found so far, M = {m_k} is a set of seed MDPs representing our target tasks, Δ_D and Δ_G are distance functions (potentially different — diversity and grounding may use different notions of "distance"), and the α weights control the relative importance of each term.

The three-way balance: Learning potential alone produces environments that are informative but possibly irrelevant (left panel of Figure 3 in the paper). Adding diversity finds more environments, including both relevant and irrelevant ones (middle panel). Adding grounding focuses the search on environments resembling real tasks (right panel). You need all three.

The Open-Ended Exploration Criterion

Explore a 2D task space. Adjust the weights α_D (diversity) and α_G (grounding) to see how they affect which environments the search discovers. Bright regions have high learning potential. Pink stars are real tasks of interest.

Diversity α_D: 0.5 Grounding α_G: 0.5

Connection to Quality-Diversity

If you squint at Equation 4, it looks a lot like quality-diversity (QD) optimization from evolutionary computing. QD algorithms find a diverse set of high-quality solutions, where quality is measured within distinct subspaces of the solution space. MAP-Elites is a famous example.

But there's a crucial difference: in QD, the quality measure is fixed. In open-ended exploration, the learning potential C(m, π) changes as the agent improves. An environment that offered great learning potential yesterday might offer none today, because the agent mastered it. The authors coin the term learning-diversity to distinguish this non-stationary variant from standard QD.

Worked Example: What Numbers Look Like

Let's make this concrete. Suppose we have three candidate environments and an agent with current skill level 0.6 on a [0, 1] scale:

# Environment A: easy maze (agent solves it 95% of the time)
C(A, π) = 0.05   # low learning potential — already mastered

# Environment B: medium maze (agent solves it 40% of the time)
C(B, π) = 0.60   # high learning potential — in the ZPD

# Environment C: impossible maze (agent solves it 0.1% of the time)
C(C, π) = 0.01   # low learning potential — too hard to learn from

# With diversity and grounding (seed task: web navigation)
# B is a web navigation maze: Δ_G(B, seed) = 0.1 (close to target)
# B is far from previous solutions: Δ_D(B, archive) = 0.8

score(B) = 0.60 + 0.5 × 0.8 - 0.5 × 0.1 = 0.95  # high: useful, diverse, relevant

This is why the criterion works: it finds environments at the frontier of the agent's abilities that are also diverse and relevant. As the agent improves, the frontier shifts, and the search naturally finds harder challenges.

Why does the open-ended exploration criterion need a "grounding" term in addition to learning potential and diversity?

To make the optimization convex Without grounding, the search can discover valid but irrelevant environments that have nothing to do with the tasks we care about Grounding reduces the computational cost

Chapter 7: From RL to SL

We've built up the open-ended exploration criterion in the context of RL — finding new MDPs for the agent to train in. But the paper's title says "general intelligence," not "general RL." How does this apply to supervised learning?

Datasets as Single-Step MDPs

Here's the unifying insight: a supervised learning dataset is just a special case of an MDP. Specifically, a single-step MDP with no dynamics.

Take a dataset D = {(x_i, y_i)}. We can recast it as an MDP where:

SL Dataset as a Single-Step MDP

MDP Component	SL Interpretation
States S	The set of inputs {x_i}
Actions A	The output space Y
Transition T	Terminates immediately (single step)
Reward R	−L(f(x), y), the negated loss
Initial state p	Uniform distribution over inputs

The optimal policy for this MDP is exactly the model that minimizes empirical risk on D. So supervised learning is RL in a single-step MDP! This isn't just a cute equivalence — it lets us apply the entire open-ended exploration framework to SL.

The unification: Once you see SL datasets as MDPs, the same Equation 4 applies. Active collection in SL means searching for new data points that maximize learning potential (points where the model is wrong), diversity (points that are different from existing training data), and grounding (points that resemble real tasks). This is a generalization of active learning and curriculum learning.

What This Means in Practice

Prioritized training in SL is exploring within a single-step MDP: choosing which training examples to present, in what order. Active learning and curriculum learning are specific implementations. They correspond to specific choices of the learning potential term C(m, π).

Active collection in SL is exploring across single-step MDPs: finding or generating new data points that aren't in the current training set. Data augmentation, synthetic data generation, and web crawling for new examples are all forms of this.

Generative models as simulators: In RL, the simulator generates training trajectories. In SL, a generative model (diffusion model, language model) can play the same role — generating synthetic training data. This data can be grounded to real data (the seed MDPs M) while being explored for diversity and learning potential.

The SL-RL Correspondence

Toggle between the SL and RL views. See how the same concepts — prioritized training, active collection, generative data sources — apply in both settings.

Concrete Example: Training Our Virtual Assistant

The paper walks through how open-ended learning might work for the virtual assistant:

Initial training: SL on a conversational corpus + RL in a web navigation simulator + RLHF for friendly, helpful responses. This is what we already know how to do.

Active online collection: Deploy the assistant. Curate interactions where it performed poorly (low user engagement, failed tasks). This is data from the model's weak spots — high learning potential.

Active offline collection: Identify specific weaknesses (e.g., websites where navigation fails). Target data collection for those domains via crowdsourcing, human-in-the-loop takeovers, or custom webcrawlers.

Synthetic data generation: Use generative models grounded to the real data to amplify rare scenarios — unusual website layouts, uncommon request types. This is active collection in the generative model's latent space.

Iterated retraining: Retrain on the adversarial dataset. The agent improves. Repeat the cycle. Over time, it becomes capable across an increasing number of domains.

How does the paper unify exploration across SL and RL?

By showing that an SL dataset is a single-step MDP, so the same exploration criterion (learning potential + diversity + grounding) applies to both settings By converting all RL problems into SL problems By using the same neural network architecture for both

Chapter 8: Software Squared

The paper's argument isn't just about RL or even ML in general. It's about a fundamental paradigm shift in how we build intelligent systems.

Three Paradigms

The authors frame the evolution of computing as a series of shifts in what gets designed manually vs. what gets learned:

Software 1.0: Humans directly write the solution in code. You design the algorithm, implement it, debug it. The solution is explicit and handcrafted.

Software 2.0 (Karpathy's term): Humans design an objective function (loss function + training data), and a neural network learns the solution. You don't write the algorithm — you write the specification, and optimization finds the program. The shift: from designing solutions to designing objectives.

Software Squared (this paper): Humans design the exploration criterion — the process that generates training data — and the model learns from whatever data this process produces. You don't even design the training data anymore. You design the meta-process that creates it. The shift: from designing objectives to designing data-generation processes.

The Three Paradigms of Software

Paradigm	What Humans Design	What Is Learned	Key Skill
Software 1.0	The solution (code)	Nothing	Programming
Software 2.0	The objective (loss + data)	The solution	Loss engineering + data curation
Software²	The exploration criterion	The data + the solution	Exploration design

The prediction: Just as Software 2.0 changed how we build systems (from writing algorithms to designing training setups), Software Squared will change it again: from curating datasets to designing self-adapting data-generation processes. The research frontier shifts from "how do we train on this data?" to "how do we find the right data?"

Evidence This Is Already Happening

The paper was written in 2022. Looking at what's happened since, the data-centric prediction has been remarkably prescient:

RLHF and DPO: The explosion of preference learning is precisely about finding better training data — human preferences that target model weaknesses
Synthetic data: Models training on data generated by other models (and by themselves) — active collection via generative models
Constitutional AI: Designing principles (the exploration criterion) rather than specific training examples
Self-play and debate: Models generating their own challenging training data — the inner and outer loops co-evolving
Test-time compute scaling: Even at inference time, search (a form of exploration) improves performance

Open Problems the Paper Identifies

The paper doesn't pretend to have solved open-ended exploration. It identifies eight key open problems:

What domain should we study open-ended learning in? Minecraft? The internet? The real world?
How do we build scalable data generators? A generative model that continually invents new tasks while compressing old ones
How should agents interface with open-ended task spaces? Adaptable input/output representations, tool invention
How do we measure open-ended learning? Novelty metrics ≠ capability metrics
How do we determine what data to collect next? Non-stationary search in latent space
How do we scalably augment training data? Generative models as parameterized data spaces
How much grounding should we use? Too much = narrow; too little = irrelevant
How do we safely collect data online? Active collection can distort distributions and create harmful biases

What does "Software Squared" mean in the paper's framing?

Writing software that writes other software Using two neural networks Designing the data-generation process rather than the data itself — the shift from curating datasets to designing self-adapting exploration criteria

Chapter 9: Connections

Alternative Paths to General Intelligence

The paper positions open-ended exploration against other proposed routes to AGI:

AIXI (Hutter, 2007): The theoretically optimal agent — maximizes performance across all computable MDPs, weighted by Kolmogorov complexity. Provably optimal but non-computable. The paper argues this top-down approach offers theoretical insight but no practical path forward. Open-ended exploration is a bottom-up alternative.

Gödel Machines (Schmidhuber): A self-improving program that searches for provably better versions of itself. Elegant but requires brute-force proof search — computationally intractable. Open-ended exploration is more practical because it adapts the search to the agent's current capabilities.

POWERPLAY (Schmidhuber): Finds the simplest modifications to a program that make it more general. The paper's framework generalizes this by adding diversity and grounding terms and using data-driven active collection rather than pure program search.

AI-Generating Algorithms (Clune, 2019): An evolutionary process that co-evolves problems and solvers. Open-ended exploration aims for a single generalist agent rather than a population of specialists.

Key Equations Cheat Sheet

Paper Equations at a Glance

Equation	Name	What It Says
J(θ) = E_π[Σ γ^t r_t]	RL Objective	Maximize expected discounted return
θ* = argmin E_D(θ*)[L(Z, θ)]	Performative Fixed Point	Bootstrap trap: model determines its own data
m* = argmax C(m, π)	Basic Search	Find environments with highest learning potential
m* = argmax [C + α_DΣΔ_D − Σα_kΔ_G]	Open-Ended Criterion (Eq. 4)	Learning potential + diversity − distance from seed tasks
Υ = Σ 2^−K(μ) V_μ^π	Universal Intelligence (Legg & Hutter)	AIXI's measure — sum over all MDPs, weighted by simplicity

Related Lessons

Policy Gradient methods — the inner loop of RL training that this paper's outer loop wraps around
RLHF / DPO — preference learning as a specific form of active collection from human feedback
Reward Learning — learning what to optimize, complementary to learning what data to train on
Sim-to-Real Transfer — the domain gap that motivates exploring beyond fixed simulators
RL² / Meta-RL — learning to learn, which can be viewed as learning to explore

Why This Paper Matters (in 2026)

Written in November 2022 — one month before ChatGPT launched — this paper was remarkably prescient. The subsequent explosion of RLHF, constitutional AI, synthetic data, self-play, and test-time compute all represent moves toward the data-centric paradigm the authors predicted. The core insight — that the bottleneck is data, not models — has only become more relevant as model architectures converge and training data becomes the key differentiator between frontier labs.

The paper's framework gives us a language for thinking about these developments: every new training trick is either improving the inner loop (better sampling from existing data) or the outer loop (finding new data). The open-ended exploration criterion tells us what good data looks like: high learning potential, diverse, and grounded to real tasks.

The meta-lesson: The history of AI progress looks like a sequence of paradigm shifts: from hand-coded rules → learned features → learned architectures → learned objectives → learned data-generation processes. Each shift moves one more layer of the system from "human-designed" to "learned." Open-ended exploration is the next layer.

What distinguishes the paper's open-ended exploration approach from AIXI's top-down definition of general intelligence?

AIXI is non-computable and defines optimality over all possible MDPs with a fixed prior; open-ended exploration is a practical, bottom-up process that incrementally expands capabilities through active data collection adapted to the agent's current state AIXI uses neural networks while open-ended exploration uses evolutionary algorithms There is no difference — they solve the same problem