Kochenderfer et al., Chapter 16

Model-Based Methods

Build a map of the world from scratch. Learn the transition and reward functions through interaction, then plan with them. The first complete approach to reinforcement learning in MDPs.

Prerequisites: MDPs & value iteration (Ch 7), Exploration & bandits (Ch 15), Dirichlet distributions.
10
Chapters
5
Simulations
10
Quizzes

Chapter 0: The Problem with Unknown MDPs

Imagine you wake up in a strange building. You know you can walk left, right, up, or down. You know some rooms reward you with candy, others with pain. But you have no idea which rooms connect to which, or where the candy is. What do you do?

In a standard MDP, the transition function T(s'|s,a) and reward function R(s,a) are given. Value iteration can then compute the optimal policy exactly. But in reinforcement learning, you don't know T or R. You have to discover them by acting.

Model-based methods address this by learning explicit estimates T̃(s'|s,a) and R̃(s,a) from experience. Once learned, any MDP solver can derive a policy. The cycle is: act, observe, update model, re-plan, repeat.

Act
Take action a in state s, observe r and next state s'
Update Model
Increment count N(s,a,s'). Update reward estimate ρ(s,a).
Re-plan
Solve the estimated MDP to get an updated value function and policy
Explore or Exploit
Use the policy to act, but balance gaining new information
↻ repeat
Model-based vs. model-free: Model-based methods explicitly learn T̃ and R̃, then plan. Model-free methods (Chapter 17) learn value functions directly from experience without ever representing the model. Model-based methods tend to be more sample efficient — they extract more information from each transition — but carry a higher per-step computation cost.

This chapter covers the full spectrum from maximum likelihood models (count transitions, take the mode) through Bayesian methods (maintain full posteriors over model parameters) to posterior sampling (an elegant approximation that naturally balances exploration and exploitation).

What is the core loop of model-based reinforcement learning?

Chapter 1: Maximum Likelihood Models

The simplest and most intuitive approach: count what you see, then normalize. We maintain a transition count N(s,a,s') that records how many times we transitioned from state s to s' after taking action a. The maximum likelihood estimate is:

T̃(s'|s,a) = N(s,a,s') / N(s,a)    where N(s,a) = ∑s'' N(s,a,s'')

If N(s,a) = 0 (we've never tried action a from state s), we have no data. The standard fix is to initialize with a small pseudocount (e.g., N(s,a,s') = 1/|S| for all s') or to default to a uniform distribution over successor states. An alternative: defer computing T̃ for never-visited pairs and treat them as "unknown" — which is exactly the approach R-MAX formalizes in Chapter 5.

A key property: T̃(s'|s,a) is the maximum likelihood estimator of the categorical distribution over successor states. This estimator is unbiased (its expectation equals the true probability) and consistent (it converges to the true probability as N(s,a) → ∞). However, with finite data, it can be severely wrong — especially for transitions with small probability. A transition that truly occurs 1% of the time requires ~100 visits to estimate reliably.

Similarly, we accumulate total reward ρ(s,a) and estimate:

R̃(s,a) = ρ(s,a) / N(s,a)

This yields an estimated MDP (S, A, T̃, R̃, γ) that we can hand to any planning algorithm. As we collect more experience, the estimates converge to the true model.

Comparing maximum likelihood to Bayesian estimation highlights a key engineering tradeoff. ML is simpler and has lower computational overhead: just increment a counter and divide. Bayesian estimation requires maintaining a full Dirichlet posterior and sampling from it. However, Bayesian estimation provides several advantages that often justify the overhead in practice: it naturally encodes uncertainty, prevents zero-probability estimates for unseen transitions, and enables principled posterior sampling for exploration.

The convergence rate of the ML estimate to the true model depends on the mixing properties of the MDP. For an ergodic Markov chain (every state is reachable from every other state under the optimal policy), the ML estimate T̃ converges at rate O(1/√N(s,a)) for each state-action pair — the standard concentration rate for multinomial distributions. For non-ergodic chains, some state-action pairs may never be visited under the greedy policy, requiring deliberate exploration to estimate them.

The hex world example from the text: After 10 episodes of 10 steps, the estimated transition matrices are sparse and noisy — most entries are zero because we haven't visited most transitions yet. After 1000 episodes, the estimates closely approximate the true probabilities. The rate of convergence depends on how much the policy explores.
ExperienceN(s,a,s')T̃(s'|s,a)Accuracy
1 transition1 entry = 11.0 for observed s', 0 elsewhereVery poor
10 transitionsSparse countsOnly seen transitions non-zeroPoor
100 transitionsMost entries filledRough approximationModerate
1000+ transitionsDense countsClose to true probabilitiesGood
Transition Count Accumulation

A 4-state MDP with 2 actions. The left heatmap shows the learned T̃(s'|s,a=0); the right heatmap shows the true T. Watch them converge as you accumulate experience. Rows=current state s, Columns=next state s'.

Steps: 0
Why this works mathematically: For a fixed state-action pair (s,a), the transition N(s,a,·) is a multinomial distribution. The maximum likelihood estimator for a multinomial is exactly the observed frequencies. By the law of large numbers, T̃(s'|s,a) → T(s'|s,a) as N(s,a) → ∞.

Here is the complete data structure for a maximum likelihood model, and how to update it:

python
import numpy as np

class MaxLikelihoodModel:
    def __init__(self, n_states, n_actions):
        self.n_states = n_states
        self.n_actions = n_actions
        # N[s, a, s'] = transition count from s to s' via action a
        self.N = np.zeros((n_states, n_actions, n_states))
        # rho[s, a] = cumulative reward from (s, a)
        self.rho = np.zeros((n_states, n_actions))

    def update(self, s, a, r, s_next):
        # One new (s, a, r, s') transition observed
        self.N[s, a, s_next] += 1
        self.rho[s, a] += r

    def T_hat(self, s, a):
        # Estimated T(s'|s,a) as probability vector
        total = self.N[s, a].sum()
        if total == 0:
            return np.ones(self.n_states) / self.n_states  # uniform prior
        return self.N[s, a] / total

    def R_hat(self, s, a):
        # Estimated R(s,a)
        total = self.N[s, a].sum()
        if total == 0:
            return 0.0
        return self.rho[s, a] / total

    def value_iteration(self, gamma=0.95, tol=1e-6):
        # Solve estimated MDP to get value function and policy
        U = np.zeros(self.n_states)
        while True:
            U_new = np.zeros(self.n_states)
            for s in range(self.n_states):
                Q_sa = [self.R_hat(s, a) + gamma * self.T_hat(s, a) @ U
                        for a in range(self.n_actions)]
                U_new[s] = max(Q_sa)
            if np.max(np.abs(U_new - U)) < tol:
                break
            U = U_new
        pi = [np.argmax([self.R_hat(s, a) + gamma * self.T_hat(s, a) @ U
                          for a in range(self.n_actions)])
               for s in range(self.n_states)]
        return U, pi
Convergence guarantee: Under mild conditions (all states reachable, exploration covers all state-action pairs), the maximum likelihood model T̃ converges to the true T almost surely as the number of visits to each (s,a) pair tends to infinity. The rate is O(1/√N(s,a)) per entry — the standard Monte Carlo concentration rate. Planning with a converged model then gives the optimal policy, exactly as if T were known.
If N(s,a) = 0 for some state-action pair, the maximum likelihood estimate is undefined. What is the recommended fix?

Chapter 2: Update Schemes — When to Re-Plan

Every time we observe a new transition, our model changes. But re-solving the MDP from scratch is expensive. How often should we re-plan, and how should we do it?

Three schemes cover the spectrum from costly-but-accurate to cheap-but-approximate:

SchemeIdeaCost per stepConvergence
Full UpdateRe-solve the entire MDP after every new transition using value iteration or an LPO(|S|2|A|) × iterationsAlways current, optimal policy
Randomized UpdateBellman backup at the visited state, plus m randomly chosen additional statesO(m · |S| · |A|)Slower propagation, cheap
Prioritized SweepingUpdate the states most affected by the new information, using a priority queueO(k · |S| · |A|) for k high-priority statesFast propagation on affected states

Randomized updates (the basis of Sutton's Dyna architecture) perform a Bellman update at the visited state plus m additional randomly chosen states. Each Bellman update is:

U(s) ← maxa [ R̃(s,a) + γ ∑s' T̃(s'|s,a) U(s') ]

This spreads value information without the cost of a full solve. When m = 0, we only update the visited state. When m = |S| × |A|, we do a full update. Typical values of m are 5–50.

Dyna-Q intuition: Think of the m random updates as "daydreaming." After taking an action, the agent mentally replays m random (s,a) pairs from its model, updating value estimates as if those transitions had just occurred. These imaginary experiences propagate value information throughout the state space without any additional real-world interaction.

Here is a concrete Dyna-Q worked example with m = 5 random backups. We have a 3-state chain (s0 → s1 → s2). The goal s2 gives reward +1. All actions move right with prob 0.9. After real step (s0, right, +0, s1), we perform 5 random backups:

python
# Dyna-Q: 3-state chain, gamma=0.9
# After real step (s0, right, 0, s1):
U = {0: 0.0, 1: 0.0, 2: 1.0}   # s2 just discovered, reward=1
N = {(0,0): 5, (1,0): 3}         # visit counts so far
model_T = {(0,0): {1: 1.0},        # s0->s1 every time
           (1,0): {2: 0.9, 1: 0.1}}  # s1->s2 with prob 0.9
model_R = {(0,0): 0.0, (1,0): 0.0}  # no reward until s2

# Real step: update (s1, right) Bellman backup
U[1] = 0.0 + 0.9 * (0.9*U[2] + 0.1*U[1]) # = 0.9*(0.9+0) = 0.81
print(f"After real step: U[1] = {U[1]:.3f}")   # U[1] = 0.81

# m=5 random backups from model
import random
for _ in range(5):
    sa = random.choice(list(model_T.keys()))  # random (s,a) pair
    s, a = sa
    r = model_R[sa]
    future = sum(p*U[sp] for sp,p in model_T[sa].items())
    U[s] = r + 0.9 * future

# After random backups, U[0] gets updated whenever (s0,a) is sampled:
# U[0] = 0 + 0.9 * 1.0 * U[1] = 0.9 * 0.81 = 0.729
print(f"After backups: U[0] = {U[0]:.3f}, U[1] = {U[1]:.3f}")
# Output: U[0] = 0.729 (value propagated 2 steps in 1 real step + 5 imaginary)
What m controls: With m = 0 (just the real step), only U[s1] gets updated per real step. With m = 5, U[s0] also gets updated if (s0, right) is sampled in the random backups — propagating value 2 steps in one real step. The optimal value U[s0] = γ2 = 0.81 is reached much faster with higher m, at the cost of more computation per step.
In the randomized update scheme, what happens when m = 0 (no random updates beyond the visited state)?

Chapter 3: Prioritized Sweeping

Random updates treat all states equally. But when we discover a large change in the value of some state (e.g., we finally reach the goal and find it has a reward of +100), only a few states are directly affected: those that can reach the changed state. Prioritized sweeping focuses work exactly there.

The priority of a predecessor state-action pair (s, a) is proportional to the expected magnitude of its value change:

priority(s, a) = T̃(s | s, a) · |Unew(s) − Uold(s)|

The larger the change in U(s) and the more likely the transition from (s, a) to s, the higher the priority. We maintain a priority queue ordered by this score and always update the highest-priority state next.

A concrete worked example. 3-state chain s0 → s1 → s2. We've just discovered s2 gives reward +10, changing U(s2) from 0 to 10. How does this propagate?

StepState updatedWhyΔUPriority pushed
1 (real step)s2Just visited, reward discovered0 → 10Push s1 with priority T(s2|s1,right)×|ΔU| = 0.9×10 = 9
2 (queue pop)s1Highest priority in queue0 → 8.91Push s0 with priority 0.9×8.91 ≈ 8.02
3 (queue pop)s0Next in queue0 → 7.20No predecessors of s0

In 3 updates (1 real + 2 queue), value has propagated from goal all the way back to s0. Random updates would have needed many more steps to accomplish the same propagation.

Step 1: Take a real step
Observe (s, a, r, s'). Update model. Compute |ΔU(s')| and enqueue predecessors of s'.
Step 2: Pop highest-priority state
Perform Bellman backup. Update U(s). Enqueue all predecessors of s with new priorities.
Step 3: Repeat k times
Continue popping and updating until queue is empty or k steps are reached
↻ return to real environment interaction
Why prioritized sweeping works: Value changes in reinforcement learning propagate backward through the state space. When the goal is discovered, predecessors of the goal get high values, then their predecessors, and so on. Prioritized sweeping follows this propagation wave efficiently, updating only the states in the "blast radius" of the discovery.

Here is a Python implementation that highlights the priority queue structure. Notice how the same model data structure from the maximum likelihood chapter is reused — prioritized sweeping is a drop-in improvement to the update scheme:

python
import heapq

class PrioritizedSweepingAgent:
    def __init__(self, n_states, n_actions, gamma=0.9, threshold=0.01, k=10):
        self.n_states = n_states
        self.n_actions = n_actions
        self.gamma = gamma
        self.threshold = threshold  # min priority to enter queue
        self.k = k                  # max updates per real step
        self.N = {}                 # {(s,a,s'): count}
        self.rho = {}               # {(s,a): cumulative reward}
        self.U = [0.0] * n_states  # value function
        self.predecessors = {s: [] for s in range(n_states)}  # s -> [(s-, a-)]
        self.pq = []                # max-heap (negate priority for heapq)

    def update_model(self, s, a, r, sp):
        key = (s, a, sp)
        self.N[key] = self.N.get(key, 0) + 1
        self.rho[(s,a)] = self.rho.get((s,a), 0.0) + r
        if (s, a) not in self.predecessors.get(sp, []):
            self.predecessors.setdefault(sp, []).append((s, a))
        # Compute Bellman error at sp
        new_U_sp = self._bellman(sp)
        delta = abs(new_U_sp - self.U[sp])
        self.U[sp] = new_U_sp
        # Push predecessors with their expected priority
        for (ps, pa) in self.predecessors.get(sp, []):
            cnt = sum(self.N.get((ps,pa,spp),0) for spp in range(self.n_states))
            p = (self.N.get((ps,pa,sp),0)/cnt if cnt>0 else 0) * delta
            if p > self.threshold:
                heapq.heappush(self.pq, (-p, ps, pa))
        # k synchronous updates from priority queue
        for _ in range(self.k):
            if not self.pq: break
            _, qs, _ = heapq.heappop(self.pq)
            self.U[qs] = self._bellman(qs)

    def _bellman(self, s):
        best = -1e9
        for a in range(self.n_actions):
            cnt = sum(self.N.get((s,a,sp),0) for sp in range(self.n_states))
            if cnt == 0: continue
            r = self.rho.get((s,a),0)/cnt
            future = sum(self.N.get((s,a,sp),0)/cnt*self.U[sp]
                         for sp in range(self.n_states))
            best = max(best, r + self.gamma*future)
        return best if best > -1e9 else 0.0
Prioritized Sweeping vs. Random Updates

A 5×5 grid. The goal (bottom-right, yellow) gives +10 reward. The value function V(s) starts at 0. Compare how quickly value propagates backward from the goal under each scheme. Brighter = higher value.

Updates: 0
In prioritized sweeping, a state s gets high priority when updated. What two factors determine its priority?

Chapter 4: Exploration in Full MDPs

The exploration strategies from Chapter 15 (bandit algorithms) can be adapted to full MDPs, but with an important caveat: in a multi-state MDP, exploration is not just about trying unknown actions from the current state. It also means navigating to unexplored parts of the state space.

The simplest approach is ε-greedy: with probability ε, take a random action; otherwise, act greedily with respect to the current estimated value function. This provides local exploration but doesn't deliberately navigate to underexplored regions.

The key insight: A bandit algorithm asks "which arm should I pull now?" An MDP explorer asks "which state-action pair is most worth visiting, and how do I get there?" These are fundamentally different questions. The best exploration strategies in MDPs plan a path to unexplored regions, not just try random things at the current state.

More sophisticated approaches provide Probably Approximately Correct (PAC) guarantees: after at most some polynomial number of steps, the agent's policy is ε-optimal with probability at least 1−δ. Such bounds give rigorous sample complexity guarantees.

The PAC-MDP framework formalizes what it means to "learn efficiently." For an MDP with |S| states, |A| actions, discount γ, and desired accuracy ε with confidence 1−δ, the R-MAX algorithm achieves ε-optimality after at most:

O( |S|2|A| / ((1-γ)6ε4) · log(|S||A|/δ) )

steps. While this polynomial bound looks large, it is dramatically better than exponential-time exploration strategies. Key takeaway: the bound scales polynomially in all relevant parameters. No algorithm can do better than polynomial in the worst case (information-theoretically), so PAC-MDP algorithms like R-MAX are optimal in this sense.

To put the bound in context: with |S|=10, |A|=4, γ=0.9, ε=0.1, δ=0.05, R-MAX is guaranteed to find an 0.1-optimal policy after at most ~106 steps. This sounds large, but it's a worst-case guarantee. On typical problems, convergence happens orders of magnitude faster — as the showcase simulation in Chapter 9 demonstrates. The PAC bound matters most for safety-critical applications where you need a provable guarantee on worst-case performance, not just average-case empirical results.

Regret vs. PAC: R-MAX satisfies a PAC-MDP bound (after K steps, the policy is ε-optimal). PSRL satisfies a Bayesian regret bound (the expected cumulative suboptimality over T steps is O(√T)). These are different notions of learning quality. The PAC bound is more conservative (applies to each individual run); the regret bound averages over random instances and gives a tighter characterization of typical behavior. For domains where we care about per-episode performance from the very first episode, regret bounds are more relevant.
When to use each method in practice:
  • Small MDP, safety-critical: R-MAX. PAC-MDP guarantee means a provable bound on how badly things go during learning.
  • Small MDP, sample efficiency matters: PSRL. Near-optimal regret, no exploration hyperparameter to tune.
  • Large MDP, limited computation: ML + randomized updates (Dyna-Q style). Simple to implement and tune; good empirical performance.
  • Large MDP, good planning available: ML + prioritized sweeping. Same model quality as ML, faster value propagation.
StrategyMechanismLimitation
ε-GreedyRandom action with prob εExplores locally, may not reach distant underexplored states
Optimistic InitInitialize Q(s,a) = QmaxNo principled bound on sample complexity
UCB-style (per state)Bonus ~ 1/√N(s,a)Does not account for navigating to new states
R-MAX (next chapter)Inflated rewards for unknown state-action pairsConservative, but has PAC guarantee
Why is applying bandit exploration strategies independently at each state insufficient for MDP exploration?

Chapter 5: R-MAX — Optimism in the Face of Uncertainty

R-MAX solves the exploration problem by turning it into a planning problem. The key idea: if we have visited (s,a) fewer than m times, we don't know its true value. So we pretend it's the best possible state-action pair — we assign it the maximum possible reward rmax and model it as a self-loop (it returns to s).

R̃(s,a) = rmax   if N(s,a) < m,    else  ρ(s,a)/N(s,a)
T̃(s'|s,a) = 1[s'=s]   if N(s,a) < m,    else   N(s,a,s')/N(s,a)

With these inflated rewards, the value of an unknown (s,a) pair is rmax/(1−γ), which dominates any achievable value in the real environment. So the greedy policy (with no separate exploration mechanism) will naturally navigate to unexplored state-action pairs first.

The elegance of R-MAX: There is no separate "exploration strategy." Exploration emerges from greedy behavior with respect to the optimistic model. The agent is not randomly wandering — it is efficiently planning the path to the most valuable unknown state-action pair. Once a pair has been visited m times, its real estimate is used and it loses its artificial appeal.

R-MAX has a provable PAC-MDP guarantee: after a polynomial number of steps in |S|, |A|, 1/ε, 1/δ, and 1/(1−γ), the agent's policy is ε-optimal with probability at least 1−δ. This is a remarkably strong guarantee compared to heuristic exploration.

A concrete R-MAX worked example. 4-state corridor: s0→s1→s2→s3(goal, +100). Deterministic transitions. γ=0.9, rmax=100, m=3. Agent starts at s0.

StepStateAction (greedy)N(s,right)R̃(s,right)Reason
1–3s0right0→3100 (optimistic)N<m: unknown, optimistic reward appeals to greedy
4–6s1right0→3100 (optimistic)s0 now "known"; s1 still optimistic
7–9s2right0→3100 (optimistic)Systematically exploring the corridor
10s3True +100Goal reached! Now R̃ real, plan toward s3
11+s0right (exploit)N≥m allReal estimatesSystematic exploitation to goal

Total steps to reach the goal: ~12 (3 per state × 4 states, rounded up). Compare to ε=0.15 greedy: with probability 0.853≈0.61, the agent walks straight to s3 without exploring side transitions. But with probability 0.39, it wastes steps taking random left actions that don't advance. R-MAX reaches the goal deterministically in O(|S|×m) steps — no randomness required.

python
class RMAXAgent:
    def __init__(self, n_states, n_actions, r_max, m, gamma):
        self.n_states = n_states
        self.n_actions = n_actions
        self.r_max = r_max      # maximum possible reward
        self.m = m              # threshold: "known" after m visits
        self.gamma = gamma
        self.N = np.zeros((n_states, n_actions, n_states))
        self.rho = np.zeros((n_states, n_actions))
        self.U = np.full(n_states, r_max / (1 - gamma))  # optimistic init

    def T_hat(self, s, a):
        total = self.N[s, a].sum()
        if total < self.m:
            # Unknown: self-loop (agent "stays" in s with reward r_max)
            T = np.zeros(self.n_states)
            T[s] = 1.0
            return T
        return self.N[s, a] / total

    def R_hat(self, s, a):
        total = self.N[s, a].sum()
        if total < self.m:
            return self.r_max  # optimistic reward for unknown (s,a)
        return self.rho[s, a] / total

    def act(self, s):
        # Greedy w.r.t. optimistic model — no separate exploration needed
        Q = [self.R_hat(s, a) + self.gamma * self.T_hat(s, a) @ self.U
             for a in range(self.n_actions)]
        return np.argmax(Q)

    def update(self, s, a, r, s_next):
        self.N[s, a, s_next] += 1
        self.rho[s, a] += r
        # Re-solve the optimistic MDP (here: one Bellman update)
        Q = [self.R_hat(s, a) + self.gamma * self.T_hat(s, a) @ self.U
             for a in range(self.n_actions)]
        self.U[s] = max(Q)
R-MAX vs. ε-Greedy: Exploration Coverage

A 5×5 grid world with goal at bottom-right. Run each explorer for 200 steps and compare coverage. Brighter = more visits. R-MAX systematically covers the grid; ε-greedy wanders.

m (R-MAX threshold) 3
Why does R-MAX not need a separate exploration strategy on top of greedy action selection?

Chapter 6: Bayesian Model Learning

Maximum likelihood gives us a point estimate for the model. But a point estimate discards crucial information: how uncertain we are. After 2 observations, T̃ = 0.5 is very uncertain. After 2000 observations, T̃ = 0.5 is very reliable. The ML estimate looks the same; our uncertainty doesn't.

Bayesian model learning maintains a full posterior distribution over model parameters. For discrete MDPs, each state-action pair's transition probabilities are a categorical distribution, and its Bayesian conjugate prior is a Dirichlet distribution:

s,a) ~ Dir(N(s,a,1)+α0, N(s,a,2)+α0, …, N(s,a,|S|)+α0)

where α0 is the prior pseudocount (typically 1 for a uniform prior). After observing a transition from (s,a) to s', the update is beautifully simple:

N(s,a,s') ← N(s,a,s') + 1

That's it. The Dirichlet is its own conjugate prior: a Dirichlet prior plus multinomial observations gives a Dirichlet posterior. The posterior mean — the expected value of T(s'|s,a) — is:

E[T(s'|s,a)] = N(s,a,s') / N(s,a)    where N(s,a) = ∑s'' N(s,a,s'')

For the complete MDP, the full belief is the product of |S|·|A| independent Dirichlet distributions (one per state-action pair). Uncertainty is encoded in the concentration of the Dirichlet: low counts mean a flat, spread distribution (high uncertainty); high counts mean a concentrated distribution (low uncertainty).

Example from the text: Starting with uniform priors (all counts = 1), after observing 5 transitions from (s1,a1) — 1 self-transition and 4 transitions to s3 — the posterior is Dir([2,1,5]). The mean is [2/8, 1/8, 5/8] = [0.25, 0.125, 0.625]. Compare to the ML estimate from 5 samples: [0.2, 0, 0.8]. The Bayesian posterior never assigns zero probability to unseen successors, thanks to the prior pseudocounts.

The variance of the Dirichlet posterior gives a direct measure of how uncertain we are about each transition probability. For Dir([α1, α2, ..., αk]) with sum α0 = ∑αi:

Var[T(s'|s,a)] = αs'0−αs') / (α020+1))

With only 1 observation each (uniform prior, N=3 states): α0=3, Var = 1×2/(9×4) = 2/36 ≈ 0.056. After 100 observations (50-30-20 split): α0=103, Var(T(s'=1)) = 51×52/(1032×104) ≈ 0.0024. The variance shrinks by a factor of ~23 with 100 more observations. This variance is what PSRL exploits — high variance Dirichlets produce diverse samples, driving exploration.

python
import numpy as np
from scipy.stats import dirichlet

class BayesianModel:
    def __init__(self, n_states, n_actions, alpha0=1.0):
        # Dirichlet prior counts: N[s, a, s'] starts at alpha0
        self.N = np.full((n_states, n_actions, n_states), alpha0)
        self.rho = np.zeros((n_states, n_actions))  # reward accumulator
        self.n_states = n_states
        self.n_actions = n_actions

    def update(self, s, a, r, s_next):
        # Bayesian update = just increment the count (conjugate prior!)
        self.N[s, a, s_next] += 1
        self.rho[s, a] += r

    def sample_model(self):
        # Sample T[s,a] ~ Dirichlet(N[s,a]) for each (s,a)
        T_sample = np.zeros_like(self.N)
        for s in range(self.n_states):
            for a in range(self.n_actions):
                T_sample[s, a] = dirichlet.rvs(self.N[s, a])[0]
        return T_sample

    def posterior_mean(self):
        # E[T(s'|s,a)] = N[s,a,s'] / sum_s'' N[s,a,s'']
        counts_sum = self.N.sum(axis=-1, keepdims=True)
        return self.N / counts_sum

# Posterior Sampling RL (PSRL) main loop
def psrl(env, n_episodes, gamma=0.9):
    model = BayesianModel(env.n_states, env.n_actions)
    for episode in range(n_episodes):
        # Step 1: Sample a complete MDP from the posterior
        T_sample = model.sample_model()
        R_hat = model.rho / model.N.sum(axis=-1)  # mean reward estimate

        # Step 2: Solve the sampled MDP
        U = np.zeros(env.n_states)
        for _ in range(100):  # value iteration
            Q = R_hat + gamma * np.einsum('san,n->sa', T_sample, U)
            U = Q.max(axis=1)
        pi = Q.argmax(axis=1)

        # Step 3: Act according to the sampled policy for one episode
        s = env.reset()
        while not done:
            a = pi[s]
            s_next, r, done = env.step(a)
            # Step 4: Update Dirichlet posteriors
            model.update(s, a, r, s_next)
            s = s_next
Dirichlet Posterior for a 3-State Transition

Adjust transition counts for three successor states. The triangle shows the Dirichlet distribution over (T(s1), T(s2), T(s3)). Orange dots are samples; yellow dot is the posterior mean. More data = tighter cluster.

Count N(s,a,s1) 2
Count N(s,a,s2) 1
Count N(s,a,s3) 5
Bayesian vs. ML summary: Bayesian methods never assign zero probability to unseen transitions (the prior prevents it). They encode uncertainty through the spread of the Dirichlet. ML methods assign zero to unseen transitions and are undefined for never-visited state-action pairs. Both converge to the truth as data accumulates, but Bayesian methods make more principled use of small-data information.

Here is a concrete numeric comparison. True T(s'=1|s=0,a=0) = 0.3. We observe 5 transitions from (s=0,a=0): results are [0,0,0,1,0] (4 zeros, 1 one). Compare estimates:

MethodP(s'=1|s=0,a=0)P(s'=0|s=0,a=0)Uncertainty quantified?
ML (5 obs)1/5 = 0.2004/5 = 0.800No — point estimate only
Bayes, prior α0=1 (5 obs)2/7 ≈ 0.2865/7 ≈ 0.714Yes — Dir([5,2]) spread captures uncertainty
Bayes, prior α0=1 (0 obs)1/2 = 0.5001/2 = 0.500Yes — maximum uncertainty
True value0.3000.700

The Bayesian estimate (0.286) is slightly closer to the truth (0.300) than the ML estimate (0.200) with only 5 samples, because the prior "regularizes" the estimate. More importantly, the Dirichlet Dir([5,2]) encodes our uncertainty: the 95% credible interval for P(s'=1) is approximately [0.08, 0.60] — we know our estimate is unreliable. The ML estimate of 0.200 comes with no such warning.

After observing N(s,a,s') = 0 transitions to s', what does the Bayesian posterior predict for T(s'|s,a) (with a uniform Dirichlet prior with pseudocount α0 = 1)?

Chapter 7: Bayes-Adaptive MDPs

The Bayesian posterior over models opens a profound question: if our state of knowledge about the model is captured by a distribution, why not plan optimally over that distribution? This is the idea behind Bayes-Adaptive MDPs (BAMDPs).

A BAMDP augments the original state space with the current model belief. The hyperstate is the pair (s, φ) where s is the MDP state and φ is the belief over model parameters (the vector of Dirichlet count vectors). Actions change both the underlying state (via the true transition) and the belief (by revealing a new observation).

Original MDP state s
What state are we in? (observable)
+
Belief over model φ
What do we know about T and R? (all Dirichlet count vectors)
↓ combined into
Hyperstate (s, φ)
The BAMDP state. Policy: π(s, φ) → action

Solving the BAMDP exactly yields the Bayes-optimal policy — the policy that optimally trades off exploration (reducing φ uncertainty) and exploitation (maximizing immediate reward). This is the theoretical gold standard for model-based RL.

The curse of the BAMDP: The hyperstate space is enormous. Each Dirichlet has |S| parameters, and there are |S|×|A| Dirichlet distributions, giving a total of |S|2|A| continuous parameters in φ. For a modest MDP with 10 states and 4 actions, φ has 400 dimensions. The BAMDP is solvable exactly only for tiny toy problems.

Despite intractability in general, BAMDPs provide a theoretical framework. Approximate methods (like posterior sampling in the next chapter) can be understood as tractable approximations to the Bayes-optimal BAMDP solution.

The BAMDP horizon trick: For episodic tasks, BAMDPs with finite horizon H can sometimes be solved exactly by backward induction. The key is that φ only changes when new transitions are observed, and for a T-step episode we only observe T transitions. For small MDPs (e.g., 2 states, 2 actions) the hyperstate space has |S|2|A| = 8 Dirichlet parameters, each taking integer values bounded by T — making the problem finite but still exponentially large in T.

A worked example with numbers. Consider a tiny 2-state, 2-action MDP. True T = [[0.7,0.3],[0.4,0.6]]. We have a Dirichlet prior and after 3 observations of action a=0 from state s=0, we saw [s'=0 twice, s'=1 once]. The hyperstate φ(s=0,a=0) = [3, 2] (prior + counts). Bayes-optimal planning in the BAMDP accounts for the information value of taking action a=0 at s=0: learning that it transitions to s'=1 changes future planning. This is the exploration bonus the Bayes-optimal policy exploits automatically.

ObservationN(s=0,a=0,s'=0)N(s=0,a=0,s'=1)E[T(s'=1|s=0,a=0)]95% CI
Prior only110.50[0.03, 0.97]
3 obs (2x s'=0, 1x s'=1)320.40[0.12, 0.74]
10 obs (7x s'=0, 3x s'=1)840.33[0.15, 0.56]
100 obs (70x s'=0, 30x s'=1)71310.30[0.22, 0.39]
What is the "hyperstate" in a Bayes-Adaptive MDP?

Chapter 8: Posterior Sampling for RL (PSRL)

The BAMDP is intractable. But there is an elegant practical approximation: posterior sampling RL (PSRL), also called Thompson sampling for MDPs. The algorithm is surprisingly simple:

1. Sample a model
Draw T̃ ~ Dir(N(s,a,·)) independently for each (s,a). Draw R̃ from its posterior.
2. Solve the sampled MDP
Run value iteration or LP on (S, A, T̃, R̃, γ) to get policy π
3. Act for one episode
Follow π until the episode ends, collecting transitions
4. Update belief
Increment Dirichlet counts N(s,a,s') for each observed transition. Return to step 1.

Why does this work? When the Dirichlet posterior is diffuse (high uncertainty), samples vary widely across MDPs. Different samples lead to different policies, driving diverse exploration. As the posterior concentrates (more data), samples converge to the true MDP, and the policy stabilizes on the optimal behavior.

From the text: Posterior sampling produces models with significantly more non-zero transition probabilities than ML models, because sampling from the Dirichlet spreads probability mass. In the hex world example, the sampled model has many more non-zero transitions than the ML model, leading to more robust planning under uncertainty.

PSRL has strong theoretical guarantees: its Bayesian regret (expected cumulative suboptimality relative to the Bayes-optimal BAMDP policy) is near-optimal. Empirically, it often matches or exceeds more complex model-based algorithms while being simple to implement.

Here is the critical intuition with numbers. Suppose we have a 2-state MDP and are currently uncertain whether s1 is the high-reward state or s2. Our Dirichlet posterior concentrates roughly 50/50 on each. In a single episode of PSRL:

Both paths provide information. The true MDP is explored at exactly the rate our uncertainty warrants. As data accumulates, the posterior sharpens, and sampling becomes more consistent, steering the agent toward the true optimum. Compare this to UCB-style bonuses which require careful manual tuning of the exploration coefficient.

Metricε-GreedyR-MAXPSRL
Exploration mechanismRandom actions (prob ε)Optimistic rewardsSampled MDPs
Hyperparametersε (manually tuned)m, rmaxα0 (prior strength)
Sample complexity boundNone (heuristic)PAC-MDP (polynomial)Near-optimal Bayes regret
Handles stochastic rewards?YesYes (with extension)Yes (Bayesian prior on R)
Implementation complexityVery lowLowLow (Dirichlet + VI)
The exploration-exploitation miracle: Notice that PSRL never explicitly adds an exploration bonus or uses ε-greedy. Exploration emerges naturally: an uncertain posterior produces diverse samples, each leading to different behavior. There is no hyperparameter controlling the exploration-exploitation tradeoff — the posterior handles it automatically.
How does PSRL (posterior sampling RL) handle the exploration-exploitation tradeoff?

Chapter 9: Showcase — Model-Based RL in a Grid World

Watch model-based RL in action on a 5×5 grid world. The agent starts at the top-left and must reach the goal (bottom-right, +10 reward). Each step costs −0.1. The transition model is unknown — the agent must learn it from scratch.

Choose between ML + random updates, ML + prioritized sweeping, and posterior sampling. Watch the model being learned, the value function evolving, and the agent's path improve over episodes.

Full Model-Based RL Simulation

Left: value function V(s) (brightness = value). Right: visit count heatmap. The agent (orange dot) explores and learns. Goal = yellow star.

Algorithm
Episode speed Medium
Ep: 0 | Steps: 0 | Reward: 0
DP on hyperstate
MethodModel TypeUpdate SchemeExplorationGuarantee
ML + Full UpdatePoint estimateRe-solve entire MDPSeparate (ε-greedy)None
ML + RandomizedPoint estimateDyna-styleSeparate (ε-greedy)None
ML + PrioritizedPoint estimateValue wave propagationSeparate (ε-greedy)None
R-MAXOptimistic point est.AnyBuilt-in (optimism)PAC-MDP
BAMDP (exact)Dirichlet posteriorOptimal (intractable)Bayes-optimal
PSRLDirichlet posteriorSample + solveAutomatic via samplingNear-optimal regret
Looking ahead: Chapter 17 takes the opposite approach — model-free methods like Q-learning and SARSA learn value functions directly from (s,a,r,s') tuples without ever estimating T or R. Model-free methods have lower per-step cost but typically require more experience to converge.

Benchmarking the four methods in the 5×5 grid world simulation above reveals the following typical convergence characteristics (these are empirical observations, not theoretical bounds):

MethodEpisodes to convergenceFinal avg reward/episodeNotes
ML + Random (m=10)~80–120−0.5 to 0.2Simple, fast per step, slower convergence
ML + Prioritized~50–80−0.2 to 0.5Faster propagation, same model quality
R-MAX (m=3)~30–500.1 to 0.8Systematic exploration pays off early
PSRL~40–700.0 to 0.6Automatic exploration; variance between runs is higher
Sample complexity intuition: Why does PSRL need fewer samples than ε-greedy? Consider a 4-state MDP where one state-action pair has reward 100 but is hard to reach. ε-greedy wastes many random actions exploring states we've already characterized well. PSRL automatically focuses on undercharacterized state-action pairs: uncertain Dirichlet posteriors sample diverse models, and some of those sampled models will predict high value for the underexplored state-action pair, driving the policy toward it efficiently. This is the Thompson sampling principle: exploration proportional to uncertainty.

Practical model selection guide. The choice among these methods often comes down to three practical considerations: (1) how long does the agent have to explore before exploitation matters, (2) is there a budget on compute per step, and (3) do you need formal guarantees or are empirical convergence properties sufficient?

ConsiderationChooseRationale
Hard time budget (must converge in N steps)R-MAXPAC guarantee: polynomial sample complexity regardless of initial conditions
Unknown state space, partial observabilityPSRLBayesian exploration adapts naturally; no need to prespecify m threshold
Small discrete MDP, fast compute per stepPrioritized sweeping + ε-greedySimple, interpretable, low overhead per step
Reward shaping possible (domain knowledge)ML + Dyna-QShaped rewards guide exploration; model-based planning amplifies each real step
Continuous state with known parametric structureGaussian process model-based RLBeyond this chapter; but Bayesian principles from PSRL extend naturally
In terms of sample complexity, which guarantee is strongest?
← Ch 15: Exploration Ch 17: Model-Free Methods →