A hierarchy of environment classes — from Bernoulli schemes to POMDPs — and which ones admit self-optimising agents.
In the previous chapter, we showed that AIXI is self-optimising in any class of environments where self-optimisation is possible. But what exactly are these classes? Where does AIXI succeed, and where are the limits?
This chapter creates a taxonomy: a hierarchy of environment classes, from the simplest (coin flips) to the most general (arbitrary computable processes). We prove which ones admit self-optimising agents. The diversity of classes where AIXI learns optimally adds weight to the claim that AIXI is super intelligent.
The simplest possible environment: each perception is independent and identically distributed (i.i.d.). What you observe now has nothing to do with what you observed before or what actions you took. Think of rolling a die forever.
Bernoulli schemes are trivially passive — the agent's actions have no effect. They are the simplest environments, but they establish the foundation: even here, a learning agent must figure out the distribution from observations.
A natural generalisation: let the next observation depend on the previous observation. This gives us a Markov chain.
More general still: a Totally Passive Environment lets the next observation depend on the full history of observations (but not actions). And a Passive Environment allows the environment to freely set rewards based on actions, while keeping future observations independent of actions.
A special case worth naming: Sequence Prediction Problems, where the agent is rewarded for correctly predicting a sequence it cannot influence. This is precisely the setting of Solomonoff induction.
In passive environments, the agent is a spectator. Active environments let the agent affect what happens next. This is where intelligence becomes meaningful.
The simplest active environment is a Bandit: the next perception depends only on the last action. Think of a row of slot machines (the "multi-armed bandit"). Pull an arm, get a reward. The next reward depends on which arm you pull, not on any history.
A Markov Decision Process (MDP) is a Bernoulli scheme for the first cycle, then for all subsequent cycles the perception depends only on the previous observation and the current action. This is the workhorse of reinforcement learning.
MDPs generalise Bernoulli schemes, bandits, and Markov chains. nth order MDPs allow the next state to depend on the last n observations and actions. Any nth order MDP can be reduced to a first-order MDP by enlarging the state space.
A Partially Observable MDP (POMDP) adds an observation function: the agent sees a noisy version of the true state. POMDPs are much harder than MDPs because the agent must infer the hidden state from partial observations.
POMDPs can model non-stationary MDPs (by hiding a time-varying parameter) and encompass almost all environments in this chapter, including all practical applications. They are also the most general class that can be expressed with Markov-like structure.
Many AI problems fit neatly into the taxonomy:
Function maximisation: The agent's actions are inputs to a function, and the observation is the function value. Reward is given for the best value found so far. This covers optimisation problems.
Repeated strategic games: Chess, Go, card games — any game played repeatedly. Each match is an episode of length l. The environment repeats the game structure in each episode.
Classification: The environment presents examples (attribute vectors) from some distribution and asks the agent to guess the class. Reward for correct guesses. This covers supervised learning.
| Problem Class | Environment Type |
|---|---|
| Function maximisation | Active (not MDP) |
| Repeated strategic game | Reducible to ergodic MDP |
| Classification | Passive MDP (ergodic) |
| Sequence prediction | Passive environment |
A Markov chain is ergodic if the current observation cannot impose any long-term constraints on future observations. Formally, it must be irreducible (every observation can reach every other) and aperiodic (no fixed cycling pattern).
The fix: add a small probability of self-transition. If state A has a 0.1 chance of staying in A, the deterministic pattern breaks and the chain becomes ergodic.
For MDPs, ergodicity means: there exists some agent such that the resulting Markov chain is ergodic. In an ergodic MDP, no matter what mistakes the agent makes, there is always a way to recover. This is crucial for learning: the agent can explore freely without permanently ruining its chances.
Now the payoff: we prove that many of our environment classes are actually ergodic MDPs (or reducible to them), and therefore admit self-optimising agents.
Bernoulli schemes are trivially ergodic MDPs. Observations don't depend on history or actions, so every observation is always accessible.
Classification problems are passive MDPs. The data distribution is independent of history, and the observation depends only on the current cycle. Ergodic.
Bandits are ergodic MDPs. Every arm is always available, so no action permanently constrains the future.
Repeated strategic games are reducible to ergodic MDPs. Each episode can be viewed as a single "super-action" in a bandit-like framework.
Sequence prediction is the interesting exception. No agent can be self-optimising over all computable sequences, because for any predictor there exists a sequence designed to fool it. However, Solomonoff's predictor has bounded total error, so a universal agent over computable sequences is self-optimising in a weaker sense.
This chapter provides the foundation for claiming that AIXI is genuinely intelligent in the sense of Chapter 1: it performs well in a wide range of environments. Next, Chapter 4 will turn this observation into a formal measure of intelligence.