Legg, Chapter 3

Taxonomy of Environments

A hierarchy of environment classes — from Bernoulli schemes to POMDPs — and which ones admit self-optimising agents.

Prerequisites: Chapter 2 (agent-environment model, AIXI).
9
Chapters
2
Simulations
9
Quizzes

Chapter 0: Why Classify?

In the previous chapter, we showed that AIXI is self-optimising in any class of environments where self-optimisation is possible. But what exactly are these classes? Where does AIXI succeed, and where are the limits?

This chapter creates a taxonomy: a hierarchy of environment classes, from the simplest (coin flips) to the most general (arbitrary computable processes). We prove which ones admit self-optimising agents. The diversity of classes where AIXI learns optimally adds weight to the claim that AIXI is super intelligent.

The punchline: Many well-known environment classes — Bernoulli schemes, classification problems, bandits, repeated games, and all ergodic MDPs — are either ergodic MDPs themselves or reducible to them. Ergodic MDPs admit self-optimising agents. Therefore, AIXI learns to behave optimally in all of these settings.
Check: Why do we need to classify environments?

Chapter 1: Bernoulli Schemes

The simplest possible environment: each perception is independent and identically distributed (i.i.d.). What you observe now has nothing to do with what you observed before or what actions you took. Think of rolling a die forever.

Example: A 6-sided die is thrown repeatedly. The agent receives reward 1 when a 6 appears, 0 otherwise. There are no actions to take. Formally, A = {ε}, O = {1,2,3,4,5,6}, and μ(xk) = 1/6 for each outcome. The measure is completely independent of history.

Bernoulli schemes are trivially passive — the agent's actions have no effect. They are the simplest environments, but they establish the foundation: even here, a learning agent must figure out the distribution from observations.

Check: What defines a Bernoulli scheme?

Chapter 2: Markov Chains

A natural generalisation: let the next observation depend on the previous observation. This gives us a Markov chain.

Example — Ring game: A board with 20 cells arranged in a ring. A pebble starts at cell 1. Each turn, a die determines how many cells the pebble moves clockwise. Reward 1 when landing on cell 5 or 15. The next state depends only on the current state and the die roll — not on any earlier history.

More general still: a Totally Passive Environment lets the next observation depend on the full history of observations (but not actions). And a Passive Environment allows the environment to freely set rewards based on actions, while keeping future observations independent of actions.

A special case worth naming: Sequence Prediction Problems, where the agent is rewarded for correctly predicting a sequence it cannot influence. This is precisely the setting of Solomonoff induction.

Check: How does a Markov chain differ from a Bernoulli scheme?

Chapter 3: Active Environments

In passive environments, the agent is a spectator. Active environments let the agent affect what happens next. This is where intelligence becomes meaningful.

The simplest active environment is a Bandit: the next perception depends only on the last action. Think of a row of slot machines (the "multi-armed bandit"). Pull an arm, get a reward. The next reward depends on which arm you pull, not on any history.

Bandit example: n levers, each with a different but fixed probability of reward. The agent must figure out which lever is best by trying them. This requires balancing exploration (trying different levers) and exploitation (pulling the best known lever). Even this simple setting is surprisingly hard to solve optimally.
Check: What distinguishes active from passive environments?

Chapter 4: MDPs and POMDPs

A Markov Decision Process (MDP) is a Bernoulli scheme for the first cycle, then for all subsequent cycles the perception depends only on the previous observation and the current action. This is the workhorse of reinforcement learning.

MDPs generalise Bernoulli schemes, bandits, and Markov chains. nth order MDPs allow the next state to depend on the last n observations and actions. Any nth order MDP can be reduced to a first-order MDP by enlarging the state space.

A Partially Observable MDP (POMDP) adds an observation function: the agent sees a noisy version of the true state. POMDPs are much harder than MDPs because the agent must infer the hidden state from partial observations.

POMDPs can model non-stationary MDPs (by hiding a time-varying parameter) and encompass almost all environments in this chapter, including all practical applications. They are also the most general class that can be expressed with Markov-like structure.

Check: How does a POMDP differ from a standard MDP?

Chapter 5: Problem Classes

Many AI problems fit neatly into the taxonomy:

Function maximisation: The agent's actions are inputs to a function, and the observation is the function value. Reward is given for the best value found so far. This covers optimisation problems.

Repeated strategic games: Chess, Go, card games — any game played repeatedly. Each match is an episode of length l. The environment repeats the game structure in each episode.

Classification: The environment presents examples (attribute vectors) from some distribution and asks the agent to guess the class. Reward for correct guesses. This covers supervised learning.

Problem ClassEnvironment Type
Function maximisationActive (not MDP)
Repeated strategic gameReducible to ergodic MDP
ClassificationPassive MDP (ergodic)
Sequence predictionPassive environment
Check: In the environment taxonomy, what kind of environment is a classification problem?

Chapter 6: Ergodicity

A Markov chain is ergodic if the current observation cannot impose any long-term constraints on future observations. Formally, it must be irreducible (every observation can reach every other) and aperiodic (no fixed cycling pattern).

Non-ergodic example (Heaven and Hell): Two states, A and B. Starting in A, the chain transitions to B or C. Once in B or C, it stays there forever. This is not ergodic because the first transition permanently constrains the future.
Non-ergodic example (Periodic): Two states, A and B, that always alternate: A → B → A → B. Even though both states are visited, the pattern is deterministic — knowing the current state tells you all future states at even/odd times. Not ergodic.

The fix: add a small probability of self-transition. If state A has a 0.1 chance of staying in A, the deterministic pattern breaks and the chain becomes ergodic.

For MDPs, ergodicity means: there exists some agent such that the resulting Markov chain is ergodic. In an ergodic MDP, no matter what mistakes the agent makes, there is always a way to recover. This is crucial for learning: the agent can explore freely without permanently ruining its chances.

The key theorem: Ergodic MDPs admit self-optimising agents. This means that in any ergodic MDP, it is possible for an agent to converge to optimal performance. And by AIXI's self-optimising property, AIXI will converge to optimal in these environments too.
Check: Why is ergodicity important for learning agents?

Chapter 7: Self-Optimising Agents

Now the payoff: we prove that many of our environment classes are actually ergodic MDPs (or reducible to them), and therefore admit self-optimising agents.

Bernoulli schemes are trivially ergodic MDPs. Observations don't depend on history or actions, so every observation is always accessible.

Classification problems are passive MDPs. The data distribution is independent of history, and the observation depends only on the current cycle. Ergodic.

Bandits are ergodic MDPs. Every arm is always available, so no action permanently constrains the future.

Repeated strategic games are reducible to ergodic MDPs. Each episode can be viewed as a single "super-action" in a bandit-like framework.

Sequence prediction is the interesting exception. No agent can be self-optimising over all computable sequences, because for any predictor there exists a sequence designed to fool it. However, Solomonoff's predictor has bounded total error, so a universal agent over computable sequences is self-optimising in a weaker sense.

The hierarchy: The taxonomy diagram shows that the most general environments are at the top (Chronological Environments, POMDPs), and the most concrete at the bottom (Bernoulli schemes, bandits). All the concrete classes at the bottom admit self-optimising agents. This means AIXI learns optimally in all of them.
Check: Which environment class is the notable exception that does NOT admit a fully self-optimising agent?

Chapter 8: Summary

Chronological Environments
Most general: any computable process
↓ specialise
POMDPs ⊃ MDPs ⊃ Bandits ⊃ Bernoulli
Increasingly structured, increasingly tractable
↓ key property
Ergodic MDPs
All concrete classes reduce here → self-optimisation possible
↓ therefore
AIXI learns optimally
In bandits, games, classification, and all ergodic MDPs

This chapter provides the foundation for claiming that AIXI is genuinely intelligent in the sense of Chapter 1: it performs well in a wide range of environments. Next, Chapter 4 will turn this observation into a formal measure of intelligence.

Check: What is the key insight that connects the taxonomy to AIXI's intelligence?