Model-Based Methods — Kochenderfer et al., Chapter 16

Chapter 0: The Problem with Unknown MDPs

Imagine you wake up in a strange building. You know you can walk left, right, up, or down. You know some rooms reward you with candy, others with pain. But you have no idea which rooms connect to which, or where the candy is. What do you do?

In a standard MDP, the transition function T(s'|s,a) and reward function R(s,a) are given. Value iteration can then compute the optimal policy exactly. But in reinforcement learning, you don't know T or R. You have to discover them by acting.

Model-based methods address this by learning explicit estimates T̃(s'|s,a) and R̃(s,a) from experience. Once learned, any MDP solver can derive a policy. The cycle is: act, observe, update model, re-plan, repeat.

Act

Take action a in state s, observe r and next state s'

↓

Update Model

Increment count N(s,a,s'). Update reward estimate ρ(s,a).

↓

Re-plan

Solve the estimated MDP to get an updated value function and policy

↓

Explore or Exploit

Use the policy to act, but balance gaining new information

↻ repeat

Model-based vs. model-free: Model-based methods explicitly learn T̃ and R̃, then plan. Model-free methods (Chapter 17) learn value functions directly from experience without ever representing the model. Model-based methods tend to be more sample efficient — they extract more information from each transition — but carry a higher per-step computation cost.

This chapter covers the full spectrum from maximum likelihood models (count transitions, take the mode) through Bayesian methods (maintain full posteriors over model parameters) to posterior sampling (an elegant approximation that naturally balances exploration and exploitation).

What is the core loop of model-based reinforcement learning?

Act, update model from observations, re-plan with the estimated MDP Run value iteration directly on the true environment Learn a value function by repeatedly sampling random actions

Chapter 1: Maximum Likelihood Models

The simplest and most intuitive approach: count what you see, then normalize. We maintain a transition count N(s,a,s') that records how many times we transitioned from state s to s' after taking action a. The maximum likelihood estimate is:

T̃(s'|s,a) = N(s,a,s') / N(s,a) where N(s,a) = ∑_s'' N(s,a,s'')

If N(s,a) = 0 (we've never tried action a from state s), we have no data. The standard fix is to initialize with a small pseudocount (e.g., N(s,a,s') = 1/|S| for all s') or to default to a uniform distribution over successor states. An alternative: defer computing T̃ for never-visited pairs and treat them as "unknown" — which is exactly the approach R-MAX formalizes in Chapter 5.

A key property: T̃(s'|s,a) is the maximum likelihood estimator of the categorical distribution over successor states. This estimator is unbiased (its expectation equals the true probability) and consistent (it converges to the true probability as N(s,a) → ∞). However, with finite data, it can be severely wrong — especially for transitions with small probability. A transition that truly occurs 1% of the time requires ~100 visits to estimate reliably.

Similarly, we accumulate total reward ρ(s,a) and estimate:

R̃(s,a) = ρ(s,a) / N(s,a)

This yields an estimated MDP (S, A, T̃, R̃, γ) that we can hand to any planning algorithm. As we collect more experience, the estimates converge to the true model.

Comparing maximum likelihood to Bayesian estimation highlights a key engineering tradeoff. ML is simpler and has lower computational overhead: just increment a counter and divide. Bayesian estimation requires maintaining a full Dirichlet posterior and sampling from it. However, Bayesian estimation provides several advantages that often justify the overhead in practice: it naturally encodes uncertainty, prevents zero-probability estimates for unseen transitions, and enables principled posterior sampling for exploration.

The convergence rate of the ML estimate to the true model depends on the mixing properties of the MDP. For an ergodic Markov chain (every state is reachable from every other state under the optimal policy), the ML estimate T̃ converges at rate O(1/√N(s,a)) for each state-action pair — the standard concentration rate for multinomial distributions. For non-ergodic chains, some state-action pairs may never be visited under the greedy policy, requiring deliberate exploration to estimate them.

The hex world example from the text: After 10 episodes of 10 steps, the estimated transition matrices are sparse and noisy — most entries are zero because we haven't visited most transitions yet. After 1000 episodes, the estimates closely approximate the true probabilities. The rate of convergence depends on how much the policy explores.

Experience	N(s,a,s')	T̃(s'\|s,a)	Accuracy
1 transition	1 entry = 1	1.0 for observed s', 0 elsewhere	Very poor
10 transitions	Sparse counts	Only seen transitions non-zero	Poor
100 transitions	Most entries filled	Rough approximation	Moderate
1000+ transitions	Dense counts	Close to true probabilities	Good

Transition Count Accumulation

A 4-state MDP with 2 actions. The left heatmap shows the learned T̃(s'|s,a=0); the right heatmap shows the true T. Watch them converge as you accumulate experience. Rows=current state s, Columns=next state s'.

Steps: 0

Why this works mathematically: For a fixed state-action pair (s,a), the transition N(s,a,·) is a multinomial distribution. The maximum likelihood estimator for a multinomial is exactly the observed frequencies. By the law of large numbers, T̃(s'|s,a) → T(s'|s,a) as N(s,a) → ∞.

Here is the complete data structure for a maximum likelihood model, and how to update it:

python
import numpy as np

class MaxLikelihoodModel:
    def __init__(self, n_states, n_actions):
        self.n_states = n_states
        self.n_actions = n_actions
        # N[s, a, s'] = transition count from s to s' via action a
        self.N = np.zeros((n_states, n_actions, n_states))
        # rho[s, a] = cumulative reward from (s, a)
        self.rho = np.zeros((n_states, n_actions))

    def update(self, s, a, r, s_next):
        # One new (s, a, r, s') transition observed
        self.N[s, a, s_next] += 1
        self.rho[s, a] += r

    def T_hat(self, s, a):
        # Estimated T(s'|s,a) as probability vector
        total = self.N[s, a].sum()
        if total == 0:
            return np.ones(self.n_states) / self.n_states  # uniform prior
        return self.N[s, a] / total

    def R_hat(self, s, a):
        # Estimated R(s,a)
        total = self.N[s, a].sum()
        if total == 0:
            return 0.0
        return self.rho[s, a] / total

    def value_iteration(self, gamma=0.95, tol=1e-6):
        # Solve estimated MDP to get value function and policy
        U = np.zeros(self.n_states)
        while True:
            U_new = np.zeros(self.n_states)
            for s in range(self.n_states):
                Q_sa = [self.R_hat(s, a) + gamma * self.T_hat(s, a) @ U
                        for a in range(self.n_actions)]
                U_new[s] = max(Q_sa)
            if np.max(np.abs(U_new - U)) < tol:
                break
            U = U_new
        pi = [np.argmax([self.R_hat(s, a) + gamma * self.T_hat(s, a) @ U
                          for a in range(self.n_actions)])
               for s in range(self.n_states)]
        return U, pi

Convergence guarantee: Under mild conditions (all states reachable, exploration covers all state-action pairs), the maximum likelihood model T̃ converges to the true T almost surely as the number of visits to each (s,a) pair tends to infinity. The rate is O(1/√N(s,a)) per entry — the standard Monte Carlo concentration rate. Planning with a converged model then gives the optimal policy, exactly as if T were known.

If N(s,a) = 0 for some state-action pair, the maximum likelihood estimate is undefined. What is the recommended fix?

Set the reward to −∞ to discourage visiting that state Default to a uniform distribution over successor states (or add pseudocounts to avoid division by zero) Remove that state-action pair from the MDP

Chapter 2: Update Schemes — When to Re-Plan

Every time we observe a new transition, our model changes. But re-solving the MDP from scratch is expensive. How often should we re-plan, and how should we do it?

Three schemes cover the spectrum from costly-but-accurate to cheap-but-approximate:

Scheme	Idea	Cost per step	Convergence
Full Update	Re-solve the entire MDP after every new transition using value iteration or an LP	O(\|S\|²\|A\|) × iterations	Always current, optimal policy
Randomized Update	Bellman backup at the visited state, plus m randomly chosen additional states	O(m · \|S\| · \|A\|)	Slower propagation, cheap
Prioritized Sweeping	Update the states most affected by the new information, using a priority queue	O(k · \|S\| · \|A\|) for k high-priority states	Fast propagation on affected states

Randomized updates (the basis of Sutton's Dyna architecture) perform a Bellman update at the visited state plus m additional randomly chosen states. Each Bellman update is:

U(s) ← max_a [ R̃(s,a) + γ ∑_s' T̃(s'|s,a) U(s') ]

This spreads value information without the cost of a full solve. When m = 0, we only update the visited state. When m = |S| × |A|, we do a full update. Typical values of m are 5–50.

Dyna-Q intuition: Think of the m random updates as "daydreaming." After taking an action, the agent mentally replays m random (s,a) pairs from its model, updating value estimates as if those transitions had just occurred. These imaginary experiences propagate value information throughout the state space without any additional real-world interaction.

Here is a concrete Dyna-Q worked example with m = 5 random backups. We have a 3-state chain (s₀ → s₁ → s₂). The goal s₂ gives reward +1. All actions move right with prob 0.9. After real step (s₀, right, +0, s₁), we perform 5 random backups:

python
# Dyna-Q: 3-state chain, gamma=0.9
# After real step (s0, right, 0, s1):
U = {0: 0.0, 1: 0.0, 2: 1.0}   # s2 just discovered, reward=1
N = {(0,0): 5, (1,0): 3}         # visit counts so far
model_T = {(0,0): {1: 1.0},        # s0->s1 every time
           (1,0): {2: 0.9, 1: 0.1}}  # s1->s2 with prob 0.9
model_R = {(0,0): 0.0, (1,0): 0.0}  # no reward until s2

# Real step: update (s1, right) Bellman backup
U[1] = 0.0 + 0.9 * (0.9*U[2] + 0.1*U[1]) # = 0.9*(0.9+0) = 0.81
print(f"After real step: U[1] = {U[1]:.3f}")   # U[1] = 0.81

# m=5 random backups from model
import random
for _ in range(5):
    sa = random.choice(list(model_T.keys()))  # random (s,a) pair
    s, a = sa
    r = model_R[sa]
    future = sum(p*U[sp] for sp,p in model_T[sa].items())
    U[s] = r + 0.9 * future

# After random backups, U[0] gets updated whenever (s0,a) is sampled:
# U[0] = 0 + 0.9 * 1.0 * U[1] = 0.9 * 0.81 = 0.729
print(f"After backups: U[0] = {U[0]:.3f}, U[1] = {U[1]:.3f}")
# Output: U[0] = 0.729 (value propagated 2 steps in 1 real step + 5 imaginary)

What m controls: With m = 0 (just the real step), only U[s₁] gets updated per real step. With m = 5, U[s₀] also gets updated if (s₀, right) is sampled in the random backups — propagating value 2 steps in one real step. The optimal value U[s₀] = γ² = 0.81 is reached much faster with higher m, at the cost of more computation per step.

In the randomized update scheme, what happens when m = 0 (no random updates beyond the visited state)?

Value information only propagates at visited states, slowing down convergence compared to full updates The algorithm converges faster because there are fewer updates It is equivalent to full update because the visited state is always updated

Chapter 3: Prioritized Sweeping

Random updates treat all states equally. But when we discover a large change in the value of some state (e.g., we finally reach the goal and find it has a reward of +100), only a few states are directly affected: those that can reach the changed state. Prioritized sweeping focuses work exactly there.

The priority of a predecessor state-action pair (s⁻, a⁻) is proportional to the expected magnitude of its value change:

priority(s⁻, a⁻) = T̃(s | s⁻, a⁻) · |U_new(s) − U_old(s)|

The larger the change in U(s) and the more likely the transition from (s⁻, a⁻) to s, the higher the priority. We maintain a priority queue ordered by this score and always update the highest-priority state next.

A concrete worked example. 3-state chain s₀ → s₁ → s₂. We've just discovered s₂ gives reward +10, changing U(s₂) from 0 to 10. How does this propagate?

Step	State updated	Why	ΔU	Priority pushed
1 (real step)	s₂	Just visited, reward discovered	0 → 10	Push s₁ with priority T(s₂\|s₁,right)×\|ΔU\| = 0.9×10 = 9
2 (queue pop)	s₁	Highest priority in queue	0 → 8.91	Push s₀ with priority 0.9×8.91 ≈ 8.02
3 (queue pop)	s₀	Next in queue	0 → 7.20	No predecessors of s₀

In 3 updates (1 real + 2 queue), value has propagated from goal all the way back to s₀. Random updates would have needed many more steps to accomplish the same propagation.

Step 1: Take a real step

Observe (s, a, r, s'). Update model. Compute |ΔU(s')| and enqueue predecessors of s'.

↓

Step 2: Pop highest-priority state

Perform Bellman backup. Update U(s). Enqueue all predecessors of s with new priorities.

↓

Step 3: Repeat k times

Continue popping and updating until queue is empty or k steps are reached

↻ return to real environment interaction

Why prioritized sweeping works: Value changes in reinforcement learning propagate backward through the state space. When the goal is discovered, predecessors of the goal get high values, then their predecessors, and so on. Prioritized sweeping follows this propagation wave efficiently, updating only the states in the "blast radius" of the discovery.

Here is a Python implementation that highlights the priority queue structure. Notice how the same model data structure from the maximum likelihood chapter is reused — prioritized sweeping is a drop-in improvement to the update scheme:

python
import heapq

class PrioritizedSweepingAgent:
    def __init__(self, n_states, n_actions, gamma=0.9, threshold=0.01, k=10):
        self.n_states = n_states
        self.n_actions = n_actions
        self.gamma = gamma
        self.threshold = threshold  # min priority to enter queue
        self.k = k                  # max updates per real step
        self.N = {}                 # {(s,a,s'): count}
        self.rho = {}               # {(s,a): cumulative reward}
        self.U = [0.0] * n_states  # value function
        self.predecessors = {s: [] for s in range(n_states)}  # s -> [(s-, a-)]
        self.pq = []                # max-heap (negate priority for heapq)

    def update_model(self, s, a, r, sp):
        key = (s, a, sp)
        self.N[key] = self.N.get(key, 0) + 1
        self.rho[(s,a)] = self.rho.get((s,a), 0.0) + r
        if (s, a) not in self.predecessors.get(sp, []):
            self.predecessors.setdefault(sp, []).append((s, a))
        # Compute Bellman error at sp
        new_U_sp = self._bellman(sp)
        delta = abs(new_U_sp - self.U[sp])
        self.U[sp] = new_U_sp
        # Push predecessors with their expected priority
        for (ps, pa) in self.predecessors.get(sp, []):
            cnt = sum(self.N.get((ps,pa,spp),0) for spp in range(self.n_states))
            p = (self.N.get((ps,pa,sp),0)/cnt if cnt>0 else 0) * delta
            if p > self.threshold:
                heapq.heappush(self.pq, (-p, ps, pa))
        # k synchronous updates from priority queue
        for _ in range(self.k):
            if not self.pq: break
            _, qs, _ = heapq.heappop(self.pq)
            self.U[qs] = self._bellman(qs)

    def _bellman(self, s):
        best = -1e9
        for a in range(self.n_actions):
            cnt = sum(self.N.get((s,a,sp),0) for sp in range(self.n_states))
            if cnt == 0: continue
            r = self.rho.get((s,a),0)/cnt
            future = sum(self.N.get((s,a,sp),0)/cnt*self.U[sp]
                         for sp in range(self.n_states))
            best = max(best, r + self.gamma*future)
        return best if best > -1e9 else 0.0

Prioritized Sweeping vs. Random Updates

A 5×5 grid. The goal (bottom-right, yellow) gives +10 reward. The value function V(s) starts at 0. Compare how quickly value propagates backward from the goal under each scheme. Brighter = higher value.

Updates: 0

In prioritized sweeping, a state s⁻ gets high priority when updated. What two factors determine its priority?

Its distance from the agent's current position and its reward Its age in the queue and its state index The transition probability to the changed successor and the magnitude of the successor's value change

Chapter 4: Exploration in Full MDPs

The exploration strategies from Chapter 15 (bandit algorithms) can be adapted to full MDPs, but with an important caveat: in a multi-state MDP, exploration is not just about trying unknown actions from the current state. It also means navigating to unexplored parts of the state space.

The simplest approach is ε-greedy: with probability ε, take a random action; otherwise, act greedily with respect to the current estimated value function. This provides local exploration but doesn't deliberately navigate to underexplored regions.

The key insight: A bandit algorithm asks "which arm should I pull now?" An MDP explorer asks "which state-action pair is most worth visiting, and how do I get there?" These are fundamentally different questions. The best exploration strategies in MDPs plan a path to unexplored regions, not just try random things at the current state.

More sophisticated approaches provide Probably Approximately Correct (PAC) guarantees: after at most some polynomial number of steps, the agent's policy is ε-optimal with probability at least 1−δ. Such bounds give rigorous sample complexity guarantees.

The PAC-MDP framework formalizes what it means to "learn efficiently." For an MDP with |S| states, |A| actions, discount γ, and desired accuracy ε with confidence 1−δ, the R-MAX algorithm achieves ε-optimality after at most:

O( |S|²|A| / ((1-γ)⁶ε⁴) · log(|S||A|/δ) )

steps. While this polynomial bound looks large, it is dramatically better than exponential-time exploration strategies. Key takeaway: the bound scales polynomially in all relevant parameters. No algorithm can do better than polynomial in the worst case (information-theoretically), so PAC-MDP algorithms like R-MAX are optimal in this sense.

To put the bound in context: with |S|=10, |A|=4, γ=0.9, ε=0.1, δ=0.05, R-MAX is guaranteed to find an 0.1-optimal policy after at most ~10⁶ steps. This sounds large, but it's a worst-case guarantee. On typical problems, convergence happens orders of magnitude faster — as the showcase simulation in Chapter 9 demonstrates. The PAC bound matters most for safety-critical applications where you need a provable guarantee on worst-case performance, not just average-case empirical results.

Regret vs. PAC: R-MAX satisfies a PAC-MDP bound (after K steps, the policy is ε-optimal). PSRL satisfies a Bayesian regret bound (the expected cumulative suboptimality over T steps is O(√T)). These are different notions of learning quality. The PAC bound is more conservative (applies to each individual run); the regret bound averages over random instances and gives a tighter characterization of typical behavior. For domains where we care about per-episode performance from the very first episode, regret bounds are more relevant.

When to use each method in practice:

Small MDP, safety-critical: R-MAX. PAC-MDP guarantee means a provable bound on how badly things go during learning.
Small MDP, sample efficiency matters: PSRL. Near-optimal regret, no exploration hyperparameter to tune.
Large MDP, limited computation: ML + randomized updates (Dyna-Q style). Simple to implement and tune; good empirical performance.
Large MDP, good planning available: ML + prioritized sweeping. Same model quality as ML, faster value propagation.

Strategy	Mechanism	Limitation
ε-Greedy	Random action with prob ε	Explores locally, may not reach distant underexplored states
Optimistic Init	Initialize Q(s,a) = Q_max	No principled bound on sample complexity
UCB-style (per state)	Bonus ~ 1/√N(s,a)	Does not account for navigating to new states
R-MAX (next chapter)	Inflated rewards for unknown state-action pairs	Conservative, but has PAC guarantee

Why is applying bandit exploration strategies independently at each state insufficient for MDP exploration?

They only reason about the current state's unknown actions, not about navigating to completely unvisited regions of the state space Bandit algorithms cannot handle more than two actions They require knowing the reward function in advance

Chapter 5: R-MAX — Optimism in the Face of Uncertainty

R-MAX solves the exploration problem by turning it into a planning problem. The key idea: if we have visited (s,a) fewer than m times, we don't know its true value. So we pretend it's the best possible state-action pair — we assign it the maximum possible reward r_max and model it as a self-loop (it returns to s).

R̃(s,a) = r_max if N(s,a) < m, else ρ(s,a)/N(s,a)

T̃(s'|s,a) = 1[s'=s] if N(s,a) < m, else N(s,a,s')/N(s,a)

With these inflated rewards, the value of an unknown (s,a) pair is r_max/(1−γ), which dominates any achievable value in the real environment. So the greedy policy (with no separate exploration mechanism) will naturally navigate to unexplored state-action pairs first.

The elegance of R-MAX: There is no separate "exploration strategy." Exploration emerges from greedy behavior with respect to the optimistic model. The agent is not randomly wandering — it is efficiently planning the path to the most valuable unknown state-action pair. Once a pair has been visited m times, its real estimate is used and it loses its artificial appeal.

R-MAX has a provable PAC-MDP guarantee: after a polynomial number of steps in |S|, |A|, 1/ε, 1/δ, and 1/(1−γ), the agent's policy is ε-optimal with probability at least 1−δ. This is a remarkably strong guarantee compared to heuristic exploration.

A concrete R-MAX worked example. 4-state corridor: s₀→s₁→s₂→s₃(goal, +100). Deterministic transitions. γ=0.9, r_max=100, m=3. Agent starts at s₀.

Step	State	Action (greedy)	N(s,right)	R̃(s,right)	Reason
1–3	s₀	right	0→3	100 (optimistic)	N<m: unknown, optimistic reward appeals to greedy
4–6	s₁	right	0→3	100 (optimistic)	s₀ now "known"; s₁ still optimistic
7–9	s₂	right	0→3	100 (optimistic)	Systematically exploring the corridor
10	s₃	—	—	True +100	Goal reached! Now R̃ real, plan toward s₃
11+	s₀	right (exploit)	N≥m all	Real estimates	Systematic exploitation to goal

Total steps to reach the goal: ~12 (3 per state × 4 states, rounded up). Compare to ε=0.15 greedy: with probability 0.85³≈0.61, the agent walks straight to s₃ without exploring side transitions. But with probability 0.39, it wastes steps taking random left actions that don't advance. R-MAX reaches the goal deterministically in O(|S|×m) steps — no randomness required.

python
class RMAXAgent:
    def __init__(self, n_states, n_actions, r_max, m, gamma):
        self.n_states = n_states
        self.n_actions = n_actions
        self.r_max = r_max      # maximum possible reward
        self.m = m              # threshold: "known" after m visits
        self.gamma = gamma
        self.N = np.zeros((n_states, n_actions, n_states))
        self.rho = np.zeros((n_states, n_actions))
        self.U = np.full(n_states, r_max / (1 - gamma))  # optimistic init

    def T_hat(self, s, a):
        total = self.N[s, a].sum()
        if total < self.m:
            # Unknown: self-loop (agent "stays" in s with reward r_max)
            T = np.zeros(self.n_states)
            T[s] = 1.0
            return T
        return self.N[s, a] / total

    def R_hat(self, s, a):
        total = self.N[s, a].sum()
        if total < self.m:
            return self.r_max  # optimistic reward for unknown (s,a)
        return self.rho[s, a] / total

    def act(self, s):
        # Greedy w.r.t. optimistic model — no separate exploration needed
        Q = [self.R_hat(s, a) + self.gamma * self.T_hat(s, a) @ self.U
             for a in range(self.n_actions)]
        return np.argmax(Q)

    def update(self, s, a, r, s_next):
        self.N[s, a, s_next] += 1
        self.rho[s, a] += r
        # Re-solve the optimistic MDP (here: one Bellman update)
        Q = [self.R_hat(s, a) + self.gamma * self.T_hat(s, a) @ self.U
             for a in range(self.n_actions)]
        self.U[s] = max(Q)

R-MAX vs. ε-Greedy: Exploration Coverage

A 5×5 grid world with goal at bottom-right. Run each explorer for 200 steps and compare coverage. Brighter = more visits. R-MAX systematically covers the grid; ε-greedy wanders.

m (R-MAX threshold) 3

Why does R-MAX not need a separate exploration strategy on top of greedy action selection?

It initializes all Q-values to zero Unknown state-action pairs are given the maximum possible value, making them always preferred by the greedy policy It samples actions randomly for the first T steps

Chapter 6: Bayesian Model Learning

Maximum likelihood gives us a point estimate for the model. But a point estimate discards crucial information: how uncertain we are. After 2 observations, T̃ = 0.5 is very uncertain. After 2000 observations, T̃ = 0.5 is very reliable. The ML estimate looks the same; our uncertainty doesn't.

Bayesian model learning maintains a full posterior distribution over model parameters. For discrete MDPs, each state-action pair's transition probabilities are a categorical distribution, and its Bayesian conjugate prior is a Dirichlet distribution:

(θ_s,a) ~ Dir(N(s,a,1)+α₀, N(s,a,2)+α₀, …, N(s,a,|S|)+α₀)

where α₀ is the prior pseudocount (typically 1 for a uniform prior). After observing a transition from (s,a) to s', the update is beautifully simple:

N(s,a,s') ← N(s,a,s') + 1

That's it. The Dirichlet is its own conjugate prior: a Dirichlet prior plus multinomial observations gives a Dirichlet posterior. The posterior mean — the expected value of T(s'|s,a) — is:

E[T(s'|s,a)] = N(s,a,s') / N(s,a) where N(s,a) = ∑_s'' N(s,a,s'')

For the complete MDP, the full belief is the product of |S|·|A| independent Dirichlet distributions (one per state-action pair). Uncertainty is encoded in the concentration of the Dirichlet: low counts mean a flat, spread distribution (high uncertainty); high counts mean a concentrated distribution (low uncertainty).

Example from the text: Starting with uniform priors (all counts = 1), after observing 5 transitions from (s₁,a₁) — 1 self-transition and 4 transitions to s₃ — the posterior is Dir([2,1,5]). The mean is [2/8, 1/8, 5/8] = [0.25, 0.125, 0.625]. Compare to the ML estimate from 5 samples: [0.2, 0, 0.8]. The Bayesian posterior never assigns zero probability to unseen successors, thanks to the prior pseudocounts.

The variance of the Dirichlet posterior gives a direct measure of how uncertain we are about each transition probability. For Dir([α₁, α₂, ..., α_k]) with sum α₀ = ∑α_i:

Var[T(s'|s,a)] = α_s'(α₀−α_s') / (α₀²(α₀+1))

With only 1 observation each (uniform prior, N=3 states): α₀=3, Var = 1×2/(9×4) = 2/36 ≈ 0.056. After 100 observations (50-30-20 split): α₀=103, Var(T(s'=1)) = 51×52/(103²×104) ≈ 0.0024. The variance shrinks by a factor of ~23 with 100 more observations. This variance is what PSRL exploits — high variance Dirichlets produce diverse samples, driving exploration.

python
import numpy as np
from scipy.stats import dirichlet

class BayesianModel:
    def __init__(self, n_states, n_actions, alpha0=1.0):
        # Dirichlet prior counts: N[s, a, s'] starts at alpha0
        self.N = np.full((n_states, n_actions, n_states), alpha0)
        self.rho = np.zeros((n_states, n_actions))  # reward accumulator
        self.n_states = n_states
        self.n_actions = n_actions

    def update(self, s, a, r, s_next):
        # Bayesian update = just increment the count (conjugate prior!)
        self.N[s, a, s_next] += 1
        self.rho[s, a] += r

    def sample_model(self):
        # Sample T[s,a] ~ Dirichlet(N[s,a]) for each (s,a)
        T_sample = np.zeros_like(self.N)
        for s in range(self.n_states):
            for a in range(self.n_actions):
                T_sample[s, a] = dirichlet.rvs(self.N[s, a])[0]
        return T_sample

    def posterior_mean(self):
        # E[T(s'|s,a)] = N[s,a,s'] / sum_s'' N[s,a,s'']
        counts_sum = self.N.sum(axis=-1, keepdims=True)
        return self.N / counts_sum

# Posterior Sampling RL (PSRL) main loop
def psrl(env, n_episodes, gamma=0.9):
    model = BayesianModel(env.n_states, env.n_actions)
    for episode in range(n_episodes):
        # Step 1: Sample a complete MDP from the posterior
        T_sample = model.sample_model()
        R_hat = model.rho / model.N.sum(axis=-1)  # mean reward estimate

        # Step 2: Solve the sampled MDP
        U = np.zeros(env.n_states)
        for _ in range(100):  # value iteration
            Q = R_hat + gamma * np.einsum('san,n->sa', T_sample, U)
            U = Q.max(axis=1)
        pi = Q.argmax(axis=1)

        # Step 3: Act according to the sampled policy for one episode
        s = env.reset()
        while not done:
            a = pi[s]
            s_next, r, done = env.step(a)
            # Step 4: Update Dirichlet posteriors
            model.update(s, a, r, s_next)
            s = s_next

Dirichlet Posterior for a 3-State Transition

Adjust transition counts for three successor states. The triangle shows the Dirichlet distribution over (T(s₁), T(s₂), T(s₃)). Orange dots are samples; yellow dot is the posterior mean. More data = tighter cluster.

Count N(s,a,s₁) 2

Count N(s,a,s₂) 1

Count N(s,a,s₃) 5

Bayesian vs. ML summary: Bayesian methods never assign zero probability to unseen transitions (the prior prevents it). They encode uncertainty through the spread of the Dirichlet. ML methods assign zero to unseen transitions and are undefined for never-visited state-action pairs. Both converge to the truth as data accumulates, but Bayesian methods make more principled use of small-data information.

Here is a concrete numeric comparison. True T(s'=1|s=0,a=0) = 0.3. We observe 5 transitions from (s=0,a=0): results are [0,0,0,1,0] (4 zeros, 1 one). Compare estimates:

Method	P(s'=1\|s=0,a=0)	P(s'=0\|s=0,a=0)	Uncertainty quantified?
ML (5 obs)	1/5 = 0.200	4/5 = 0.800	No — point estimate only
Bayes, prior α₀=1 (5 obs)	2/7 ≈ 0.286	5/7 ≈ 0.714	Yes — Dir([5,2]) spread captures uncertainty
Bayes, prior α₀=1 (0 obs)	1/2 = 0.500	1/2 = 0.500	Yes — maximum uncertainty
True value	0.300	0.700	—

The Bayesian estimate (0.286) is slightly closer to the truth (0.300) than the ML estimate (0.200) with only 5 samples, because the prior "regularizes" the estimate. More importantly, the Dirichlet Dir([5,2]) encodes our uncertainty: the 95% credible interval for P(s'=1) is approximately [0.08, 0.60] — we know our estimate is unreliable. The ML estimate of 0.200 comes with no such warning.

After observing N(s,a,s') = 0 transitions to s', what does the Bayesian posterior predict for T(s'|s,a) (with a uniform Dirichlet prior with pseudocount α₀ = 1)?

Exactly 0 (since we've never seen it) 1/|S| + small correction A small positive value 1/(|S|+N(s,a)), because the prior pseudocount prevents zero probability

Chapter 7: Bayes-Adaptive MDPs

The Bayesian posterior over models opens a profound question: if our state of knowledge about the model is captured by a distribution, why not plan optimally over that distribution? This is the idea behind Bayes-Adaptive MDPs (BAMDPs).

A BAMDP augments the original state space with the current model belief. The hyperstate is the pair (s, φ) where s is the MDP state and φ is the belief over model parameters (the vector of Dirichlet count vectors). Actions change both the underlying state (via the true transition) and the belief (by revealing a new observation).

Original MDP state s

What state are we in? (observable)

Belief over model φ

What do we know about T and R? (all Dirichlet count vectors)

↓ combined into

Hyperstate (s, φ)

The BAMDP state. Policy: π(s, φ) → action

Solving the BAMDP exactly yields the Bayes-optimal policy — the policy that optimally trades off exploration (reducing φ uncertainty) and exploitation (maximizing immediate reward). This is the theoretical gold standard for model-based RL.

The curse of the BAMDP: The hyperstate space is enormous. Each Dirichlet has |S| parameters, and there are |S|×|A| Dirichlet distributions, giving a total of |S|²|A| continuous parameters in φ. For a modest MDP with 10 states and 4 actions, φ has 400 dimensions. The BAMDP is solvable exactly only for tiny toy problems.

Despite intractability in general, BAMDPs provide a theoretical framework. Approximate methods (like posterior sampling in the next chapter) can be understood as tractable approximations to the Bayes-optimal BAMDP solution.

The BAMDP horizon trick: For episodic tasks, BAMDPs with finite horizon H can sometimes be solved exactly by backward induction. The key is that φ only changes when new transitions are observed, and for a T-step episode we only observe T transitions. For small MDPs (e.g., 2 states, 2 actions) the hyperstate space has |S|²|A| = 8 Dirichlet parameters, each taking integer values bounded by T — making the problem finite but still exponentially large in T.

A worked example with numbers. Consider a tiny 2-state, 2-action MDP. True T = [[0.7,0.3],[0.4,0.6]]. We have a Dirichlet prior and after 3 observations of action a=0 from state s=0, we saw [s'=0 twice, s'=1 once]. The hyperstate φ(s=0,a=0) = [3, 2] (prior + counts). Bayes-optimal planning in the BAMDP accounts for the information value of taking action a=0 at s=0: learning that it transitions to s'=1 changes future planning. This is the exploration bonus the Bayes-optimal policy exploits automatically.

Observation	N(s=0,a=0,s'=0)	N(s=0,a=0,s'=1)	E[T(s'=1\|s=0,a=0)]	95% CI
Prior only	1	1	0.50	[0.03, 0.97]
3 obs (2x s'=0, 1x s'=1)	3	2	0.40	[0.12, 0.74]
10 obs (7x s'=0, 3x s'=1)	8	4	0.33	[0.15, 0.56]
100 obs (70x s'=0, 30x s'=1)	71	31	0.30	[0.22, 0.39]

What is the "hyperstate" in a Bayes-Adaptive MDP?

The state with the highest value in the estimated MDP A compressed summary of the transition history The pair (s, φ): the current MDP state augmented with the full belief distribution over model parameters

Chapter 8: Posterior Sampling for RL (PSRL)

The BAMDP is intractable. But there is an elegant practical approximation: posterior sampling RL (PSRL), also called Thompson sampling for MDPs. The algorithm is surprisingly simple:

1. Sample a model

Draw T̃ ~ Dir(N(s,a,·)) independently for each (s,a). Draw R̃ from its posterior.

↓

2. Solve the sampled MDP

Run value iteration or LP on (S, A, T̃, R̃, γ) to get policy π

↓

3. Act for one episode

Follow π until the episode ends, collecting transitions

↓

4. Update belief

Increment Dirichlet counts N(s,a,s') for each observed transition. Return to step 1.

↻

Why does this work? When the Dirichlet posterior is diffuse (high uncertainty), samples vary widely across MDPs. Different samples lead to different policies, driving diverse exploration. As the posterior concentrates (more data), samples converge to the true MDP, and the policy stabilizes on the optimal behavior.

From the text: Posterior sampling produces models with significantly more non-zero transition probabilities than ML models, because sampling from the Dirichlet spreads probability mass. In the hex world example, the sampled model has many more non-zero transitions than the ML model, leading to more robust planning under uncertainty.

PSRL has strong theoretical guarantees: its Bayesian regret (expected cumulative suboptimality relative to the Bayes-optimal BAMDP policy) is near-optimal. Empirically, it often matches or exceeds more complex model-based algorithms while being simple to implement.

Here is the critical intuition with numbers. Suppose we have a 2-state MDP and are currently uncertain whether s₁ is the high-reward state or s₂. Our Dirichlet posterior concentrates roughly 50/50 on each. In a single episode of PSRL:

With ~50% probability, the sampled T suggests s₁ is better → agent heads to s₁.
With ~50% probability, the sampled T suggests s₂ is better → agent heads to s₂.

Both paths provide information. The true MDP is explored at exactly the rate our uncertainty warrants. As data accumulates, the posterior sharpens, and sampling becomes more consistent, steering the agent toward the true optimum. Compare this to UCB-style bonuses which require careful manual tuning of the exploration coefficient.

Metric	ε-Greedy	R-MAX	PSRL
Exploration mechanism	Random actions (prob ε)	Optimistic rewards	Sampled MDPs
Hyperparameters	ε (manually tuned)	m, r_max	α₀ (prior strength)
Sample complexity bound	None (heuristic)	PAC-MDP (polynomial)	Near-optimal Bayes regret
Handles stochastic rewards?	Yes	Yes (with extension)	Yes (Bayesian prior on R)
Implementation complexity	Very low	Low	Low (Dirichlet + VI)

The exploration-exploitation miracle: Notice that PSRL never explicitly adds an exploration bonus or uses ε-greedy. Exploration emerges naturally: an uncertain posterior produces diverse samples, each leading to different behavior. There is no hyperparameter controlling the exploration-exploitation tradeoff — the posterior handles it automatically.

How does PSRL (posterior sampling RL) handle the exploration-exploitation tradeoff?

Uncertain posteriors produce diverse sampled MDPs leading to varied exploration; confident posteriors produce consistent MDPs enabling exploitation — no explicit tradeoff parameter needed It uses ε-greedy with the sampled policy It alternates between pure exploration and pure exploitation episodes

Chapter 9: Showcase — Model-Based RL in a Grid World

Watch model-based RL in action on a 5×5 grid world. The agent starts at the top-left and must reach the goal (bottom-right, +10 reward). Each step costs −0.1. The transition model is unknown — the agent must learn it from scratch.

Choose between ML + random updates, ML + prioritized sweeping, and posterior sampling. Watch the model being learned, the value function evolving, and the agent's path improve over episodes.

Full Model-Based RL Simulation

Left: value function V(s) (brightness = value). Right: visit count heatmap. The agent (orange dot) explores and learns. Goal = yellow star.

Algorithm

Episode speed Medium

Ep: 0 | Steps: 0 | Reward: 0

DP on hyperstate

Method	Model Type	Update Scheme	Exploration	Guarantee
ML + Full Update	Point estimate	Re-solve entire MDP	Separate (ε-greedy)	None
ML + Randomized	Point estimate	Dyna-style	Separate (ε-greedy)	None
ML + Prioritized	Point estimate	Value wave propagation	Separate (ε-greedy)	None
R-MAX	Optimistic point est.	Any	Built-in (optimism)	PAC-MDP
BAMDP (exact)	Dirichlet posterior	Optimal (intractable)	Bayes-optimal
PSRL	Dirichlet posterior	Sample + solve	Automatic via sampling	Near-optimal regret

Looking ahead: Chapter 17 takes the opposite approach — model-free methods like Q-learning and SARSA learn value functions directly from (s,a,r,s') tuples without ever estimating T or R. Model-free methods have lower per-step cost but typically require more experience to converge.

Benchmarking the four methods in the 5×5 grid world simulation above reveals the following typical convergence characteristics (these are empirical observations, not theoretical bounds):

Method	Episodes to convergence	Final avg reward/episode	Notes
ML + Random (m=10)	~80–120	−0.5 to 0.2	Simple, fast per step, slower convergence
ML + Prioritized	~50–80	−0.2 to 0.5	Faster propagation, same model quality
R-MAX (m=3)	~30–50	0.1 to 0.8	Systematic exploration pays off early
PSRL	~40–70	0.0 to 0.6	Automatic exploration; variance between runs is higher

Sample complexity intuition: Why does PSRL need fewer samples than ε-greedy? Consider a 4-state MDP where one state-action pair has reward 100 but is hard to reach. ε-greedy wastes many random actions exploring states we've already characterized well. PSRL automatically focuses on undercharacterized state-action pairs: uncertain Dirichlet posteriors sample diverse models, and some of those sampled models will predict high value for the underexplored state-action pair, driving the policy toward it efficiently. This is the Thompson sampling principle: exploration proportional to uncertainty.

Practical model selection guide. The choice among these methods often comes down to three practical considerations: (1) how long does the agent have to explore before exploitation matters, (2) is there a budget on compute per step, and (3) do you need formal guarantees or are empirical convergence properties sufficient?

Consideration	Choose	Rationale
Hard time budget (must converge in N steps)	R-MAX	PAC guarantee: polynomial sample complexity regardless of initial conditions
Unknown state space, partial observability	PSRL	Bayesian exploration adapts naturally; no need to prespecify m threshold
Small discrete MDP, fast compute per step	Prioritized sweeping + ε-greedy	Simple, interpretable, low overhead per step
Reward shaping possible (domain knowledge)	ML + Dyna-Q	Shaped rewards guide exploration; model-based planning amplifies each real step
Continuous state with known parametric structure	Gaussian process model-based RL	Beyond this chapter; but Bayesian principles from PSRL extend naturally

In terms of sample complexity, which guarantee is strongest?

ε-greedy, because it explores most states ML + randomized updates, because it uses all collected data R-MAX, which provides a polynomial PAC-MDP bound; PSRL has a near-optimal Bayesian regret bound but not the same PAC-MDP guarantee

← Ch 15: Exploration Ch 17: Model-Free Methods →