Legg, Chapter 2

Universal Artificial Intelligence

From Bayes' rule to Solomonoff induction to AIXI: building the most powerful theoretical agent possible.

Prerequisites: Chapter 1. Basic probability. Turing machines (helpful but not required).
11
Chapters
3
Simulations
11
Quizzes

Chapter 0: The Dream

Imagine an agent that can learn to perform optimally in any computable environment. Give it a chess board, it learns to play chess. Put it in a stock market, it learns to trade. Drop it in a maze, it finds the exit. Not because it was programmed for any of these tasks, but because it has a single, universal learning algorithm powerful enough to handle anything.

This is the dream of universal artificial intelligence. It sounds impossible, but among theoretical models free of computational resource limits, such agents can be defined and analysed. The most powerful of these is AIXI, created by Marcus Hutter in the late 1990s.

The roadmap: We will build up to AIXI step by step. First, we need the theory of inductive inference (how to learn from data). Then Bayes' rule (how to update beliefs). Then Solomonoff's prior (how to assign probabilities to hypotheses using compression). Then the agent-environment model (how to formalise goal-seeking). Finally, AIXI itself: plug the universal prior into the optimal agent equation.

The catch? AIXI requires infinite computational resources. It is not a practical algorithm. But it serves as a theoretical gold standard — the upper bound of what any agent could possibly achieve. Understanding AIXI tells us what intelligence is, even if we cannot build it.

Check: Why is AIXI important even though it cannot be computed?

Chapter 1: Inductive Inference

Inductive inference is the process of observing the world and inferring the causes behind what you see. Science is largely inductive inference: observe particles in a gas chamber, notice patterns, infer physical laws. Watch stock prices, build a model, predict the market.

Formally: an agent has observed data D = x1, x2, ... xt and has a set of hypotheses H = h1, h2, ... that might explain the data. The task is to find which hypothesis best reflects the unknown process generating D.

Consider the sequence: 1, 3, 5, 7. What comes next? An obvious hypothesis: these are positive odd numbers, so 9. But another hypothesis fits too: the equation 2n - 1 + (n-1)(n-2)(n-3)(n-4), which gives 33 for n=5. Both are consistent with the data.

Epicurus' principle of multiple explanations: Keep all hypotheses that are consistent with the data. You cannot logically rule any out.

But intuitively, the simple hypothesis (odd numbers) seems much more likely. The philosophical principle behind this was stated by William of Ockham (1285-1349): do not include in the explanation anything not strictly required by the observations.

Occam's razor: Among all hypotheses consistent with the observations, the simplest is the most likely.

These two principles — Epicurus (keep all consistent hypotheses) and Occam (prefer simpler ones) — will be formalised mathematically in what follows. Epicurus tells us which hypotheses to consider; Occam tells us how to weight them.

Check: What do Epicurus' principle and Occam's razor together tell us about inductive inference?

Chapter 2: Bayes' Rule

To go from philosophy to mathematics, we need a way to compute the probability of each hypothesis given the data. This is Bayes' rule.

P(h|D) = P(D|h) P(h) / P(D)

Where:

SymbolNameMeaning
P(h|D)PosteriorProbability of hypothesis h after seeing data D
P(D|h)LikelihoodProbability of seeing data D if hypothesis h is true
P(h)PriorProbability of h before seeing any data
P(D)EvidenceProbability of the data under all hypotheses

Bayes' rule takes beliefs about the world (the prior), updates them with observed data (the likelihood), and produces refined beliefs (the posterior). It is the mathematically correct way to update beliefs.

The prior P(h) is controversial. How can you assign probability to a hypothesis before seeing any data? Bayesians respond: with enough data, the posterior depends almost entirely on the data and very little on the prior. In practice, the choice of prior usually doesn't matter much for large datasets.

Key insight: The deep problem is choosing the prior. Ideally, we would solve this once and for all with a universal prior. Epicurus says P(h) > 0 for all consistent hypotheses. Occam says P(h) should decrease with the complexity of h. We need a way to measure complexity. That is exactly what Kolmogorov complexity will provide.
Check: In Bayes' rule, what does the prior P(h) represent?

Chapter 3: Binary Sequence Prediction

Inductive inference can be cast as binary sequence prediction: there is an unknown distribution μ over binary sequences. A sequence ω is drawn from μ one bit at a time. Given ω1:t (the first t bits), predict ωt+1.

This is useful because binary sequences can encode anything: medical data, music, weather, stock prices. And it connects directly to computability theory.

Given a set of possible models H and a prior P over H, the Bayes mixture model predictor weights each model's prediction by how confident we are in that model:

P(ω1:t1) = ∑ν∈H P(ν|ω1:t) ν(ω1:t1)

This can be shown to reduce to the conditional probability from the mixture prior over sequences. Elegant: the unknown prior over hypotheses and the prior over sequences are two perspectives on the same problem.

Example — coin flipping: Suppose we have three models: always tails (p=0), fair coin (p=0.5), or always heads (p=1). With uniform prior (1/3 each), after seeing 4 heads in a row, Bayes' rule gives P(fair) = 1/17 and P(all heads) = 16/17. The mixture model predicts the next flip is heads with probability 33/34. The data rapidly concentrates belief on the correct model.
Check: Why is binary sequence prediction a useful framing for inductive inference?

Chapter 4: Kolmogorov Complexity

In the 1960s, Ray Solomonoff asked: can we build a universal prior over binary sequences? One that works for any prediction problem, without domain-specific knowledge?

His key idea: the prior probability that a sequence begins with string x should be the probability that a universal Turing machine running a randomly generated program outputs a sequence starting with x.

M(x) := ∑p : U(p) = x* 2-ℓ(p)

Where U is a prefix universal Turing machine, p ranges over all programs, and ℓ(p) is the length of p in bits. The 2-ℓ(p) term comes from each bit being chosen by a fair coin flip.

Consider a C program that prints all 1s: main(){while(1)printf("1");}. This is very short, so a randomly generated program has relatively high probability of producing it. A program that outputs the digits of π would be at least ten times longer. Thus, the sequence 111... gets much higher prior probability than 314159... This naturally implements Occam's razor!

Kolmogorov complexity: The complexity of a sequence ω is the length of the shortest program p that produces it: K(ω) := min { ℓ(p) : U(p) = ω }. Simple sequences have short programs and thus high prior probability. Complex sequences need long programs and get low prior probability. Occam's razor, formalised.

A crucial property: Kolmogorov complexity depends on the choice of universal Turing machine U only up to a constant. If we switch to a different machine U', the complexity K changes by at most a bounded amount. This invariance property makes K a fundamental, objective measure of complexity.

The biggest problem: K is not computable. We can never determine the exact shortest program for a given sequence (by the halting problem). But we can approximate it from above, and for theoretical analysis this is sufficient.

Check: What does Kolmogorov complexity measure?

Chapter 5: The Universal Prior

Solomonoff also suggested a slightly different approach: instead of summing over programs, take a mixture over the set of all enumerable semi-measures.

A semi-measure is like a probability distribution that doesn't quite add up — it is allowed to sum to less than 1. The set of all enumerable semi-measures, Me, contains all computable probability measures and more. We define the Solomonoff-Levin prior as:

ξ(x) := ∑ν ∈ Me PMe(ν) ν(x)

where PMe(ν) := 2-K(ν) is the algorithmic prior probability of each semi-measure. Simple models (short programs) get high prior probability. Complex models get low prior probability. This is Occam's razor made precise.

Key property — Dominance: The Solomonoff-Levin prior ξ is dominant over all enumerable semi-measures. For any ν in Me there exists a constant cν such that ξ(x) ≥ cν ν(x) for all strings x. In other words, ξ assigns probability at least proportional to every computable model. It never completely dismisses any hypothesis. Epicurus satisfied.

A fundamental theorem: the Solomonoff prior M and the Solomonoff-Levin prior ξ lie within a multiplicative constant of each other. For theoretical purposes, the differences between them are unimportant — both are universal priors with the key property of dominance.

Check: What does it mean for the universal prior to be "dominant"?

Chapter 6: Solomonoff Induction

With the universal prior ξ in hand, prediction becomes straightforward. Given an observed initial string ω1:t from an unknown computable distribution μ, our prediction that the next bit is 0:

ξ(ω1:t0) / ξ(ω1:t)

How good is this? Remarkably good. Solomonoff (1978) proved the following convergence theorem:

Solomonoff's convergence theorem: For any computable probability measure μ, the total squared prediction error is bounded: ∑t=1 St ≤ (ln 2 / 2) K(μ). That is, the total of all prediction errors over the entire infinite sequence is bounded by a constant proportional to the complexity of μ.

This is extraordinary. It implies rapid convergence for any computable distribution. If the true distribution has a short program, convergence is extremely fast. If it has a long program, it takes longer, but is still bounded.

The proof's key step uses the dominance property: since ξ(x) ≥ 2-K(μ) μ(x) for all x, the KL divergence between μ and ξ can be bounded, giving the result.

Although Solomonoff induction is incomputable, it connects to practical methods. Setting a uniform prior recovers maximum likelihood estimation. Other special cases recover Minimum Description Length (MDL), Minimum Message Length (MML), and maximum entropy prediction. Solomonoff induction is the ideal — all practical methods are approximations of it.

Check: What does Solomonoff's convergence theorem guarantee?

Chapter 7: The Agent-Environment Model

Solomonoff induction is a passive predictor. It observes a sequence but cannot influence it. For intelligence as we defined it, we need active agents that take actions to achieve goals.

The agent-environment model consists of two entities exchanging signals. The agent sends actions (from an alphabet A); the environment returns perceptions (from an alphabet X). Each perception consists of an observation and a reward. The reward signal tells the agent how well it is doing relative to its goal.

The agent is a function π that takes the complete interaction history and outputs the next action. The environment is described by a conditional probability measure μ that specifies how perceptions depend on the history. Together, π and μ define a measure over interaction sequences.

The key framework: This is exactly the reinforcement learning setup. The agent observes, acts, receives reward. Its goal is to maximise cumulative reward. This framework is incredibly general — chess, stock trading, robot navigation, writing a novel, all fit within it. The environment defines the task; the reward defines success.

To formalise "maximise reward over time," we define the expected future discounted reward:

Vγπμ := E [ ∑i=t γi ri ]

where γi are discount weights that prevent the sum from diverging. Geometric discounting (γi = αi for α ∈ (0,1)) is the most common choice. The parameter α controls how much the agent values future vs. immediate reward.

Check: In the agent-environment model, what is the purpose of the reward signal?

Chapter 8: Optimal Informed Agents

Given a known environment μ, the optimal agent πμ is the one that maximises expected future discounted reward. Its action at time t:

atπμ := argmaxatxt maxat+1xt+1 ... maxamxmtrt + ... + γmrm] μ(axt:m)

This is a brute-force search through all possible futures: consider every action, for each action consider every possible response, for each response consider every next action, and so on. Pick the action that leads to the highest expected reward.

This is essentially the Bellman equation from dynamic programming, generalised to arbitrary (non-Markovian) environments with full history dependence. The optimal agent does not assume the environment is Markov — it considers the entire interaction history.

Key insight: The optimal informed agent πμ knows exactly which environment it is in. It is optimal for that specific environment. But this does not meet our definition of intelligence — intelligence requires performing well in a wide range of environments. An agent constructed specifically for chess knows nothing about trading. We need something more general.
Check: Why doesn't the optimal informed agent πμ satisfy our definition of intelligence?

Chapter 9: The AIXI Agent

Here is Hutter's brilliant move: take the optimal informed agent πμ and replace the known environment μ with the universal prior ξ. The result is AIXI (πξ).

atξ := argmaxatxt maxat+1 ... ∑xmtrt + ... + γmrm] ξ(axt:m)

The unknown environment μ has been replaced with ξ, the universal mixture over all computable environments. AIXI does not know which environment it is in, but it considers all possible environments, weighted by their complexity. Simpler environments get more weight (Occam's razor). As AIXI interacts and collects data, the mixture concentrates on environments consistent with observations (Bayes' rule).

Pareto optimality: AIXI is Pareto optimal — no other agent exists that performs at least as well in all environments and strictly better in at least one. It is also balanced Pareto optimal — any improvement in one environment is offset by an equal or greater decrease elsewhere.

An even stronger result: AIXI is self-optimising. In any class of environments that admits a self-optimising agent, AIXI's performance converges to that of the optimal informed agent. Essentially, AIXI learns to behave optimally wherever it is theoretically possible to do so.

The Heaven and Hell example: Imagine two doors. One leads to "heaven" (endless reward), the other to "hell" (no reward), and the choice is irreversible. The informed agent πμ knows which door is which and always chooses heaven. AIXI must guess. In this environment, no general agent can match the informed agent — whatever AIXI chooses, there is a mirror environment where the other door was correct. This does not expose a flaw in AIXI; it shows that matching the informed agent is sometimes impossible for any general agent.

The convergence speed is exponentially worse for active environments (O(2K(μ)) incorrect choices) compared to passive prediction (O(K(μ)) total squared error). Actions have consequences that compound, making active learning fundamentally harder than passive prediction.

Check: What is the key idea behind AIXI?

Chapter 10: Summary

Induction + Bayes
Update beliefs about hypotheses given data
↓ need a universal prior
Kolmogorov + Solomonoff
Weight hypotheses by program length → Occam's razor formalised
↓ extend from prediction to action
Agent-Environment Model
Agent takes actions, environment returns observations + rewards
↓ plug in universal prior
AIXI
Pareto optimal, self-optimising universal agent

AIXI is the theoretical gold standard for intelligence. It is incomputable, but it tells us what intelligence looks like mathematically. The next chapters ask: in which environments does AIXI actually learn optimally? And what are the fundamental limits of computable approximations?

Check: What are the two key ingredients that make AIXI powerful?