Build a map of the world from scratch. Learn the transition and reward functions through interaction, then plan with them. The first complete approach to reinforcement learning in MDPs.
Imagine you wake up in a strange building. You know you can walk left, right, up, or down. You know some rooms reward you with candy, others with pain. But you have no idea which rooms connect to which, or where the candy is. What do you do?
In a standard MDP, the transition function T(s'|s,a) and reward function R(s,a) are given. Value iteration can then compute the optimal policy exactly. But in reinforcement learning, you don't know T or R. You have to discover them by acting.
Model-based methods address this by learning explicit estimates T̃(s'|s,a) and R̃(s,a) from experience. Once learned, any MDP solver can derive a policy. The cycle is: act, observe, update model, re-plan, repeat.
This chapter covers the full spectrum from maximum likelihood models (count transitions, take the mode) through Bayesian methods (maintain full posteriors over model parameters) to posterior sampling (an elegant approximation that naturally balances exploration and exploitation).
The simplest and most intuitive approach: count what you see, then normalize. We maintain a transition count N(s,a,s') that records how many times we transitioned from state s to s' after taking action a. The maximum likelihood estimate is:
If N(s,a) = 0 (we've never tried action a from state s), we have no data. The standard fix is to initialize with a small pseudocount (e.g., N(s,a,s') = 1/|S| for all s') or to default to a uniform distribution over successor states. An alternative: defer computing T̃ for never-visited pairs and treat them as "unknown" — which is exactly the approach R-MAX formalizes in Chapter 5.
A key property: T̃(s'|s,a) is the maximum likelihood estimator of the categorical distribution over successor states. This estimator is unbiased (its expectation equals the true probability) and consistent (it converges to the true probability as N(s,a) → ∞). However, with finite data, it can be severely wrong — especially for transitions with small probability. A transition that truly occurs 1% of the time requires ~100 visits to estimate reliably.
Similarly, we accumulate total reward ρ(s,a) and estimate:
This yields an estimated MDP (S, A, T̃, R̃, γ) that we can hand to any planning algorithm. As we collect more experience, the estimates converge to the true model.
Comparing maximum likelihood to Bayesian estimation highlights a key engineering tradeoff. ML is simpler and has lower computational overhead: just increment a counter and divide. Bayesian estimation requires maintaining a full Dirichlet posterior and sampling from it. However, Bayesian estimation provides several advantages that often justify the overhead in practice: it naturally encodes uncertainty, prevents zero-probability estimates for unseen transitions, and enables principled posterior sampling for exploration.
The convergence rate of the ML estimate to the true model depends on the mixing properties of the MDP. For an ergodic Markov chain (every state is reachable from every other state under the optimal policy), the ML estimate T̃ converges at rate O(1/√N(s,a)) for each state-action pair — the standard concentration rate for multinomial distributions. For non-ergodic chains, some state-action pairs may never be visited under the greedy policy, requiring deliberate exploration to estimate them.
| Experience | N(s,a,s') | T̃(s'|s,a) | Accuracy |
|---|---|---|---|
| 1 transition | 1 entry = 1 | 1.0 for observed s', 0 elsewhere | Very poor |
| 10 transitions | Sparse counts | Only seen transitions non-zero | Poor |
| 100 transitions | Most entries filled | Rough approximation | Moderate |
| 1000+ transitions | Dense counts | Close to true probabilities | Good |
A 4-state MDP with 2 actions. The left heatmap shows the learned T̃(s'|s,a=0); the right heatmap shows the true T. Watch them converge as you accumulate experience. Rows=current state s, Columns=next state s'.
Here is the complete data structure for a maximum likelihood model, and how to update it:
python import numpy as np class MaxLikelihoodModel: def __init__(self, n_states, n_actions): self.n_states = n_states self.n_actions = n_actions # N[s, a, s'] = transition count from s to s' via action a self.N = np.zeros((n_states, n_actions, n_states)) # rho[s, a] = cumulative reward from (s, a) self.rho = np.zeros((n_states, n_actions)) def update(self, s, a, r, s_next): # One new (s, a, r, s') transition observed self.N[s, a, s_next] += 1 self.rho[s, a] += r def T_hat(self, s, a): # Estimated T(s'|s,a) as probability vector total = self.N[s, a].sum() if total == 0: return np.ones(self.n_states) / self.n_states # uniform prior return self.N[s, a] / total def R_hat(self, s, a): # Estimated R(s,a) total = self.N[s, a].sum() if total == 0: return 0.0 return self.rho[s, a] / total def value_iteration(self, gamma=0.95, tol=1e-6): # Solve estimated MDP to get value function and policy U = np.zeros(self.n_states) while True: U_new = np.zeros(self.n_states) for s in range(self.n_states): Q_sa = [self.R_hat(s, a) + gamma * self.T_hat(s, a) @ U for a in range(self.n_actions)] U_new[s] = max(Q_sa) if np.max(np.abs(U_new - U)) < tol: break U = U_new pi = [np.argmax([self.R_hat(s, a) + gamma * self.T_hat(s, a) @ U for a in range(self.n_actions)]) for s in range(self.n_states)] return U, pi
Every time we observe a new transition, our model changes. But re-solving the MDP from scratch is expensive. How often should we re-plan, and how should we do it?
Three schemes cover the spectrum from costly-but-accurate to cheap-but-approximate:
| Scheme | Idea | Cost per step | Convergence |
|---|---|---|---|
| Full Update | Re-solve the entire MDP after every new transition using value iteration or an LP | O(|S|2|A|) × iterations | Always current, optimal policy |
| Randomized Update | Bellman backup at the visited state, plus m randomly chosen additional states | O(m · |S| · |A|) | Slower propagation, cheap |
| Prioritized Sweeping | Update the states most affected by the new information, using a priority queue | O(k · |S| · |A|) for k high-priority states | Fast propagation on affected states |
Randomized updates (the basis of Sutton's Dyna architecture) perform a Bellman update at the visited state plus m additional randomly chosen states. Each Bellman update is:
This spreads value information without the cost of a full solve. When m = 0, we only update the visited state. When m = |S| × |A|, we do a full update. Typical values of m are 5–50.
Here is a concrete Dyna-Q worked example with m = 5 random backups. We have a 3-state chain (s0 → s1 → s2). The goal s2 gives reward +1. All actions move right with prob 0.9. After real step (s0, right, +0, s1), we perform 5 random backups:
python # Dyna-Q: 3-state chain, gamma=0.9 # After real step (s0, right, 0, s1): U = {0: 0.0, 1: 0.0, 2: 1.0} # s2 just discovered, reward=1 N = {(0,0): 5, (1,0): 3} # visit counts so far model_T = {(0,0): {1: 1.0}, # s0->s1 every time (1,0): {2: 0.9, 1: 0.1}} # s1->s2 with prob 0.9 model_R = {(0,0): 0.0, (1,0): 0.0} # no reward until s2 # Real step: update (s1, right) Bellman backup U[1] = 0.0 + 0.9 * (0.9*U[2] + 0.1*U[1]) # = 0.9*(0.9+0) = 0.81 print(f"After real step: U[1] = {U[1]:.3f}") # U[1] = 0.81 # m=5 random backups from model import random for _ in range(5): sa = random.choice(list(model_T.keys())) # random (s,a) pair s, a = sa r = model_R[sa] future = sum(p*U[sp] for sp,p in model_T[sa].items()) U[s] = r + 0.9 * future # After random backups, U[0] gets updated whenever (s0,a) is sampled: # U[0] = 0 + 0.9 * 1.0 * U[1] = 0.9 * 0.81 = 0.729 print(f"After backups: U[0] = {U[0]:.3f}, U[1] = {U[1]:.3f}") # Output: U[0] = 0.729 (value propagated 2 steps in 1 real step + 5 imaginary)
Random updates treat all states equally. But when we discover a large change in the value of some state (e.g., we finally reach the goal and find it has a reward of +100), only a few states are directly affected: those that can reach the changed state. Prioritized sweeping focuses work exactly there.
The priority of a predecessor state-action pair (s−, a−) is proportional to the expected magnitude of its value change:
The larger the change in U(s) and the more likely the transition from (s−, a−) to s, the higher the priority. We maintain a priority queue ordered by this score and always update the highest-priority state next.
A concrete worked example. 3-state chain s0 → s1 → s2. We've just discovered s2 gives reward +10, changing U(s2) from 0 to 10. How does this propagate?
| Step | State updated | Why | ΔU | Priority pushed |
|---|---|---|---|---|
| 1 (real step) | s2 | Just visited, reward discovered | 0 → 10 | Push s1 with priority T(s2|s1,right)×|ΔU| = 0.9×10 = 9 |
| 2 (queue pop) | s1 | Highest priority in queue | 0 → 8.91 | Push s0 with priority 0.9×8.91 ≈ 8.02 |
| 3 (queue pop) | s0 | Next in queue | 0 → 7.20 | No predecessors of s0 |
In 3 updates (1 real + 2 queue), value has propagated from goal all the way back to s0. Random updates would have needed many more steps to accomplish the same propagation.
Here is a Python implementation that highlights the priority queue structure. Notice how the same model data structure from the maximum likelihood chapter is reused — prioritized sweeping is a drop-in improvement to the update scheme:
python import heapq class PrioritizedSweepingAgent: def __init__(self, n_states, n_actions, gamma=0.9, threshold=0.01, k=10): self.n_states = n_states self.n_actions = n_actions self.gamma = gamma self.threshold = threshold # min priority to enter queue self.k = k # max updates per real step self.N = {} # {(s,a,s'): count} self.rho = {} # {(s,a): cumulative reward} self.U = [0.0] * n_states # value function self.predecessors = {s: [] for s in range(n_states)} # s -> [(s-, a-)] self.pq = [] # max-heap (negate priority for heapq) def update_model(self, s, a, r, sp): key = (s, a, sp) self.N[key] = self.N.get(key, 0) + 1 self.rho[(s,a)] = self.rho.get((s,a), 0.0) + r if (s, a) not in self.predecessors.get(sp, []): self.predecessors.setdefault(sp, []).append((s, a)) # Compute Bellman error at sp new_U_sp = self._bellman(sp) delta = abs(new_U_sp - self.U[sp]) self.U[sp] = new_U_sp # Push predecessors with their expected priority for (ps, pa) in self.predecessors.get(sp, []): cnt = sum(self.N.get((ps,pa,spp),0) for spp in range(self.n_states)) p = (self.N.get((ps,pa,sp),0)/cnt if cnt>0 else 0) * delta if p > self.threshold: heapq.heappush(self.pq, (-p, ps, pa)) # k synchronous updates from priority queue for _ in range(self.k): if not self.pq: break _, qs, _ = heapq.heappop(self.pq) self.U[qs] = self._bellman(qs) def _bellman(self, s): best = -1e9 for a in range(self.n_actions): cnt = sum(self.N.get((s,a,sp),0) for sp in range(self.n_states)) if cnt == 0: continue r = self.rho.get((s,a),0)/cnt future = sum(self.N.get((s,a,sp),0)/cnt*self.U[sp] for sp in range(self.n_states)) best = max(best, r + self.gamma*future) return best if best > -1e9 else 0.0
A 5×5 grid. The goal (bottom-right, yellow) gives +10 reward. The value function V(s) starts at 0. Compare how quickly value propagates backward from the goal under each scheme. Brighter = higher value.
The exploration strategies from Chapter 15 (bandit algorithms) can be adapted to full MDPs, but with an important caveat: in a multi-state MDP, exploration is not just about trying unknown actions from the current state. It also means navigating to unexplored parts of the state space.
The simplest approach is ε-greedy: with probability ε, take a random action; otherwise, act greedily with respect to the current estimated value function. This provides local exploration but doesn't deliberately navigate to underexplored regions.
More sophisticated approaches provide Probably Approximately Correct (PAC) guarantees: after at most some polynomial number of steps, the agent's policy is ε-optimal with probability at least 1−δ. Such bounds give rigorous sample complexity guarantees.
The PAC-MDP framework formalizes what it means to "learn efficiently." For an MDP with |S| states, |A| actions, discount γ, and desired accuracy ε with confidence 1−δ, the R-MAX algorithm achieves ε-optimality after at most:
steps. While this polynomial bound looks large, it is dramatically better than exponential-time exploration strategies. Key takeaway: the bound scales polynomially in all relevant parameters. No algorithm can do better than polynomial in the worst case (information-theoretically), so PAC-MDP algorithms like R-MAX are optimal in this sense.
To put the bound in context: with |S|=10, |A|=4, γ=0.9, ε=0.1, δ=0.05, R-MAX is guaranteed to find an 0.1-optimal policy after at most ~106 steps. This sounds large, but it's a worst-case guarantee. On typical problems, convergence happens orders of magnitude faster — as the showcase simulation in Chapter 9 demonstrates. The PAC bound matters most for safety-critical applications where you need a provable guarantee on worst-case performance, not just average-case empirical results.
| Strategy | Mechanism | Limitation |
|---|---|---|
| ε-Greedy | Random action with prob ε | Explores locally, may not reach distant underexplored states |
| Optimistic Init | Initialize Q(s,a) = Qmax | No principled bound on sample complexity |
| UCB-style (per state) | Bonus ~ 1/√N(s,a) | Does not account for navigating to new states |
| R-MAX (next chapter) | Inflated rewards for unknown state-action pairs | Conservative, but has PAC guarantee |
R-MAX solves the exploration problem by turning it into a planning problem. The key idea: if we have visited (s,a) fewer than m times, we don't know its true value. So we pretend it's the best possible state-action pair — we assign it the maximum possible reward rmax and model it as a self-loop (it returns to s).
With these inflated rewards, the value of an unknown (s,a) pair is rmax/(1−γ), which dominates any achievable value in the real environment. So the greedy policy (with no separate exploration mechanism) will naturally navigate to unexplored state-action pairs first.
R-MAX has a provable PAC-MDP guarantee: after a polynomial number of steps in |S|, |A|, 1/ε, 1/δ, and 1/(1−γ), the agent's policy is ε-optimal with probability at least 1−δ. This is a remarkably strong guarantee compared to heuristic exploration.
A concrete R-MAX worked example. 4-state corridor: s0→s1→s2→s3(goal, +100). Deterministic transitions. γ=0.9, rmax=100, m=3. Agent starts at s0.
| Step | State | Action (greedy) | N(s,right) | R̃(s,right) | Reason |
|---|---|---|---|---|---|
| 1–3 | s0 | right | 0→3 | 100 (optimistic) | N<m: unknown, optimistic reward appeals to greedy |
| 4–6 | s1 | right | 0→3 | 100 (optimistic) | s0 now "known"; s1 still optimistic |
| 7–9 | s2 | right | 0→3 | 100 (optimistic) | Systematically exploring the corridor |
| 10 | s3 | — | — | True +100 | Goal reached! Now R̃ real, plan toward s3 |
| 11+ | s0 | right (exploit) | N≥m all | Real estimates | Systematic exploitation to goal |
Total steps to reach the goal: ~12 (3 per state × 4 states, rounded up). Compare to ε=0.15 greedy: with probability 0.853≈0.61, the agent walks straight to s3 without exploring side transitions. But with probability 0.39, it wastes steps taking random left actions that don't advance. R-MAX reaches the goal deterministically in O(|S|×m) steps — no randomness required.
python class RMAXAgent: def __init__(self, n_states, n_actions, r_max, m, gamma): self.n_states = n_states self.n_actions = n_actions self.r_max = r_max # maximum possible reward self.m = m # threshold: "known" after m visits self.gamma = gamma self.N = np.zeros((n_states, n_actions, n_states)) self.rho = np.zeros((n_states, n_actions)) self.U = np.full(n_states, r_max / (1 - gamma)) # optimistic init def T_hat(self, s, a): total = self.N[s, a].sum() if total < self.m: # Unknown: self-loop (agent "stays" in s with reward r_max) T = np.zeros(self.n_states) T[s] = 1.0 return T return self.N[s, a] / total def R_hat(self, s, a): total = self.N[s, a].sum() if total < self.m: return self.r_max # optimistic reward for unknown (s,a) return self.rho[s, a] / total def act(self, s): # Greedy w.r.t. optimistic model — no separate exploration needed Q = [self.R_hat(s, a) + self.gamma * self.T_hat(s, a) @ self.U for a in range(self.n_actions)] return np.argmax(Q) def update(self, s, a, r, s_next): self.N[s, a, s_next] += 1 self.rho[s, a] += r # Re-solve the optimistic MDP (here: one Bellman update) Q = [self.R_hat(s, a) + self.gamma * self.T_hat(s, a) @ self.U for a in range(self.n_actions)] self.U[s] = max(Q)
A 5×5 grid world with goal at bottom-right. Run each explorer for 200 steps and compare coverage. Brighter = more visits. R-MAX systematically covers the grid; ε-greedy wanders.
Maximum likelihood gives us a point estimate for the model. But a point estimate discards crucial information: how uncertain we are. After 2 observations, T̃ = 0.5 is very uncertain. After 2000 observations, T̃ = 0.5 is very reliable. The ML estimate looks the same; our uncertainty doesn't.
Bayesian model learning maintains a full posterior distribution over model parameters. For discrete MDPs, each state-action pair's transition probabilities are a categorical distribution, and its Bayesian conjugate prior is a Dirichlet distribution:
where α0 is the prior pseudocount (typically 1 for a uniform prior). After observing a transition from (s,a) to s', the update is beautifully simple:
That's it. The Dirichlet is its own conjugate prior: a Dirichlet prior plus multinomial observations gives a Dirichlet posterior. The posterior mean — the expected value of T(s'|s,a) — is:
For the complete MDP, the full belief is the product of |S|·|A| independent Dirichlet distributions (one per state-action pair). Uncertainty is encoded in the concentration of the Dirichlet: low counts mean a flat, spread distribution (high uncertainty); high counts mean a concentrated distribution (low uncertainty).
The variance of the Dirichlet posterior gives a direct measure of how uncertain we are about each transition probability. For Dir([α1, α2, ..., αk]) with sum α0 = ∑αi:
With only 1 observation each (uniform prior, N=3 states): α0=3, Var = 1×2/(9×4) = 2/36 ≈ 0.056. After 100 observations (50-30-20 split): α0=103, Var(T(s'=1)) = 51×52/(1032×104) ≈ 0.0024. The variance shrinks by a factor of ~23 with 100 more observations. This variance is what PSRL exploits — high variance Dirichlets produce diverse samples, driving exploration.
python import numpy as np from scipy.stats import dirichlet class BayesianModel: def __init__(self, n_states, n_actions, alpha0=1.0): # Dirichlet prior counts: N[s, a, s'] starts at alpha0 self.N = np.full((n_states, n_actions, n_states), alpha0) self.rho = np.zeros((n_states, n_actions)) # reward accumulator self.n_states = n_states self.n_actions = n_actions def update(self, s, a, r, s_next): # Bayesian update = just increment the count (conjugate prior!) self.N[s, a, s_next] += 1 self.rho[s, a] += r def sample_model(self): # Sample T[s,a] ~ Dirichlet(N[s,a]) for each (s,a) T_sample = np.zeros_like(self.N) for s in range(self.n_states): for a in range(self.n_actions): T_sample[s, a] = dirichlet.rvs(self.N[s, a])[0] return T_sample def posterior_mean(self): # E[T(s'|s,a)] = N[s,a,s'] / sum_s'' N[s,a,s''] counts_sum = self.N.sum(axis=-1, keepdims=True) return self.N / counts_sum # Posterior Sampling RL (PSRL) main loop def psrl(env, n_episodes, gamma=0.9): model = BayesianModel(env.n_states, env.n_actions) for episode in range(n_episodes): # Step 1: Sample a complete MDP from the posterior T_sample = model.sample_model() R_hat = model.rho / model.N.sum(axis=-1) # mean reward estimate # Step 2: Solve the sampled MDP U = np.zeros(env.n_states) for _ in range(100): # value iteration Q = R_hat + gamma * np.einsum('san,n->sa', T_sample, U) U = Q.max(axis=1) pi = Q.argmax(axis=1) # Step 3: Act according to the sampled policy for one episode s = env.reset() while not done: a = pi[s] s_next, r, done = env.step(a) # Step 4: Update Dirichlet posteriors model.update(s, a, r, s_next) s = s_next
Adjust transition counts for three successor states. The triangle shows the Dirichlet distribution over (T(s1), T(s2), T(s3)). Orange dots are samples; yellow dot is the posterior mean. More data = tighter cluster.
Here is a concrete numeric comparison. True T(s'=1|s=0,a=0) = 0.3. We observe 5 transitions from (s=0,a=0): results are [0,0,0,1,0] (4 zeros, 1 one). Compare estimates:
| Method | P(s'=1|s=0,a=0) | P(s'=0|s=0,a=0) | Uncertainty quantified? |
|---|---|---|---|
| ML (5 obs) | 1/5 = 0.200 | 4/5 = 0.800 | No — point estimate only |
| Bayes, prior α0=1 (5 obs) | 2/7 ≈ 0.286 | 5/7 ≈ 0.714 | Yes — Dir([5,2]) spread captures uncertainty |
| Bayes, prior α0=1 (0 obs) | 1/2 = 0.500 | 1/2 = 0.500 | Yes — maximum uncertainty |
| True value | 0.300 | 0.700 | — |
The Bayesian estimate (0.286) is slightly closer to the truth (0.300) than the ML estimate (0.200) with only 5 samples, because the prior "regularizes" the estimate. More importantly, the Dirichlet Dir([5,2]) encodes our uncertainty: the 95% credible interval for P(s'=1) is approximately [0.08, 0.60] — we know our estimate is unreliable. The ML estimate of 0.200 comes with no such warning.
The Bayesian posterior over models opens a profound question: if our state of knowledge about the model is captured by a distribution, why not plan optimally over that distribution? This is the idea behind Bayes-Adaptive MDPs (BAMDPs).
A BAMDP augments the original state space with the current model belief. The hyperstate is the pair (s, φ) where s is the MDP state and φ is the belief over model parameters (the vector of Dirichlet count vectors). Actions change both the underlying state (via the true transition) and the belief (by revealing a new observation).
Solving the BAMDP exactly yields the Bayes-optimal policy — the policy that optimally trades off exploration (reducing φ uncertainty) and exploitation (maximizing immediate reward). This is the theoretical gold standard for model-based RL.
Despite intractability in general, BAMDPs provide a theoretical framework. Approximate methods (like posterior sampling in the next chapter) can be understood as tractable approximations to the Bayes-optimal BAMDP solution.
A worked example with numbers. Consider a tiny 2-state, 2-action MDP. True T = [[0.7,0.3],[0.4,0.6]]. We have a Dirichlet prior and after 3 observations of action a=0 from state s=0, we saw [s'=0 twice, s'=1 once]. The hyperstate φ(s=0,a=0) = [3, 2] (prior + counts). Bayes-optimal planning in the BAMDP accounts for the information value of taking action a=0 at s=0: learning that it transitions to s'=1 changes future planning. This is the exploration bonus the Bayes-optimal policy exploits automatically.
| Observation | N(s=0,a=0,s'=0) | N(s=0,a=0,s'=1) | E[T(s'=1|s=0,a=0)] | 95% CI |
|---|---|---|---|---|
| Prior only | 1 | 1 | 0.50 | [0.03, 0.97] |
| 3 obs (2x s'=0, 1x s'=1) | 3 | 2 | 0.40 | [0.12, 0.74] |
| 10 obs (7x s'=0, 3x s'=1) | 8 | 4 | 0.33 | [0.15, 0.56] |
| 100 obs (70x s'=0, 30x s'=1) | 71 | 31 | 0.30 | [0.22, 0.39] |
The BAMDP is intractable. But there is an elegant practical approximation: posterior sampling RL (PSRL), also called Thompson sampling for MDPs. The algorithm is surprisingly simple:
Why does this work? When the Dirichlet posterior is diffuse (high uncertainty), samples vary widely across MDPs. Different samples lead to different policies, driving diverse exploration. As the posterior concentrates (more data), samples converge to the true MDP, and the policy stabilizes on the optimal behavior.
PSRL has strong theoretical guarantees: its Bayesian regret (expected cumulative suboptimality relative to the Bayes-optimal BAMDP policy) is near-optimal. Empirically, it often matches or exceeds more complex model-based algorithms while being simple to implement.
Here is the critical intuition with numbers. Suppose we have a 2-state MDP and are currently uncertain whether s1 is the high-reward state or s2. Our Dirichlet posterior concentrates roughly 50/50 on each. In a single episode of PSRL:
Both paths provide information. The true MDP is explored at exactly the rate our uncertainty warrants. As data accumulates, the posterior sharpens, and sampling becomes more consistent, steering the agent toward the true optimum. Compare this to UCB-style bonuses which require careful manual tuning of the exploration coefficient.
| Metric | ε-Greedy | R-MAX | PSRL |
|---|---|---|---|
| Exploration mechanism | Random actions (prob ε) | Optimistic rewards | Sampled MDPs |
| Hyperparameters | ε (manually tuned) | m, rmax | α0 (prior strength) |
| Sample complexity bound | None (heuristic) | PAC-MDP (polynomial) | Near-optimal Bayes regret |
| Handles stochastic rewards? | Yes | Yes (with extension) | Yes (Bayesian prior on R) |
| Implementation complexity | Very low | Low | Low (Dirichlet + VI) |
Watch model-based RL in action on a 5×5 grid world. The agent starts at the top-left and must reach the goal (bottom-right, +10 reward). Each step costs −0.1. The transition model is unknown — the agent must learn it from scratch.
Choose between ML + random updates, ML + prioritized sweeping, and posterior sampling. Watch the model being learned, the value function evolving, and the agent's path improve over episodes.
Left: value function V(s) (brightness = value). Right: visit count heatmap. The agent (orange dot) explores and learns. Goal = yellow star.
| Method | Model Type | Update Scheme | Exploration | Guarantee |
|---|---|---|---|---|
| ML + Full Update | Point estimate | Re-solve entire MDP | Separate (ε-greedy) | None |
| ML + Randomized | Point estimate | Dyna-style | Separate (ε-greedy) | None |
| ML + Prioritized | Point estimate | Value wave propagation | Separate (ε-greedy) | None |
| R-MAX | Optimistic point est. | Any | Built-in (optimism) | PAC-MDP |
| BAMDP (exact) | Dirichlet posterior | DP on hyperstate | Optimal (intractable) | Bayes-optimal |
| PSRL | Dirichlet posterior | Sample + solve | Automatic via sampling | Near-optimal regret |
Benchmarking the four methods in the 5×5 grid world simulation above reveals the following typical convergence characteristics (these are empirical observations, not theoretical bounds):
| Method | Episodes to convergence | Final avg reward/episode | Notes |
|---|---|---|---|
| ML + Random (m=10) | ~80–120 | −0.5 to 0.2 | Simple, fast per step, slower convergence |
| ML + Prioritized | ~50–80 | −0.2 to 0.5 | Faster propagation, same model quality |
| R-MAX (m=3) | ~30–50 | 0.1 to 0.8 | Systematic exploration pays off early |
| PSRL | ~40–70 | 0.0 to 0.6 | Automatic exploration; variance between runs is higher |
Practical model selection guide. The choice among these methods often comes down to three practical considerations: (1) how long does the agent have to explore before exploitation matters, (2) is there a budget on compute per step, and (3) do you need formal guarantees or are empirical convergence properties sufficient?
| Consideration | Choose | Rationale |
|---|---|---|
| Hard time budget (must converge in N steps) | R-MAX | PAC guarantee: polynomial sample complexity regardless of initial conditions |
| Unknown state space, partial observability | PSRL | Bayesian exploration adapts naturally; no need to prespecify m threshold |
| Small discrete MDP, fast compute per step | Prioritized sweeping + ε-greedy | Simple, interpretable, low overhead per step |
| Reward shaping possible (domain knowledge) | ML + Dyna-Q | Shaped rewards guide exploration; model-based planning amplifies each real step |
| Continuous state with known parametric structure | Gaussian process model-based RL | Beyond this chapter; but Bayesian principles from PSRL extend naturally |