Ch 5: Monte Carlo Methods — Sutton & Barto RL

Chapter 0: Learning from Experience

In Chapter 4 you learned Dynamic Programming: sweep through every state, apply the Bellman equation, converge to optimal values. Beautiful — but it requires a complete model of the environment. You must know every transition probability and every reward. What if you don't?

Imagine you're learning to play blackjack. You don't know the exact probability of drawing each card from the shoe. But you can play thousands of hands and track your winnings. Over time, you learn which situations are favorable. That is the Monte Carlo idea: learn from complete episodes of experience.

Monte Carlo (MC) methods require only experience — sequences of states, actions, and rewards from actual or simulated episodes. They need no model of the environment's dynamics. The critical requirement is that episodes must terminate: MC learns from complete returns, so it must wait until an episode ends.

The core shift: DP computes expectations using the model: ∑_s',r p(s',r|s,a)[...]. MC estimates those expectations by averaging sample returns. No model, no bootstrapping — just play episodes and average the results.

DP (Ch 4)

Needs full model p(s',r|s,a). Sweeps all states.

↓ drop the model requirement

MC (Ch 5)

Needs only episodes. Averages returns from experience.

↓ drop the complete-episode requirement

TD (Ch 6)

Learns from single steps. Best of both worlds.

There are two properties that make MC methods distinctive. First, they do not bootstrap — unlike DP, they do not update value estimates based on other value estimates. They wait for the real return. Second, they can focus on states of interest rather than sweeping the entire state space. If you only care about a few starting positions, MC can estimate their values without touching the rest.

The term "Monte Carlo" broadly refers to any method that uses random sampling to estimate a quantity. In RL, it means averaging complete episode returns to estimate value functions. The randomness comes from the policy and the environment — each episode unfolds differently.

MC vs DP: Learning a Value

DP uses the model to compute expected values. MC generates episodes and averages returns. Click "Run Episodes" to see the MC estimate converge to the true value (dashed line).

Check: What does MC require that DP does not?

A complete model of the environment Complete episodes that terminate A neural network to approximate values

Chapter 1: MC Prediction

The prediction problem: given a policy π, estimate v_π(s). DP solved this with iterative sweeps using the model. MC solves it by a stunningly simple idea: play episodes under π, and average the returns that follow each visit to s.

Concretely, the return from time step t is the discounted sum of future rewards:

G_t = R_t+1 + γR_t+2 + γ²R_t+3 + … + γ^T-t-1R_T

After generating many episodes, the value of state s is estimated as the average of all the returns observed after visiting s:

V(s) ≈ average of { G_t : S_t = s }

By the law of large numbers, this average converges to v_π(s) = E_π[G_t | S_t = s] as the number of visits grows. No model needed. No bootstrapping. Just play and average.

Key insight: MC prediction is conceptually simpler than DP. It trades mathematical exactness (DP's full-model computation) for statistical estimation (averaging samples). The more episodes you run, the more accurate your estimate becomes. Each episode is an independent sample of the return.

There is a subtlety: a state might appear multiple times in a single episode. Do we average the return from every visit, or only the first? This gives rise to two variants — first-visit MC and every-visit MC — which we explore next.

pseudocode
# MC Prediction (first-visit)
Input: policy π
Initialize V(s) arbitrarily, Returns(s) ← empty list

repeat forever:
  Generate episode under π: S₀, A₀, R₁, S₁, …, S_T
  G ← 0
  for t = T−1, T−2, …, 0:
    G ← γ · G + R_t+1
    if S_t not in {S₀, …, S_t−1}:
      append G to Returns(S_t)
      V(S_t) ← average(Returns(S_t))

Check: Why does MC process the episode backward (from T-1 to 0)?

So it can compute G_t incrementally: G ← γG + R_t+1 Because later states are more important To avoid updating terminal states

Chapter 2: First-Visit vs Every-Visit

When a state s appears multiple times in an episode, we have a choice. First-visit MC averages only the return following the first time s is visited in each episode. Every-visit MC averages the returns following every time s is visited.

	First-Visit MC	Every-Visit MC
Which returns?	Only from first visit per episode	From every visit in every episode
Bias	Unbiased	Biased (but consistent)
Convergence	Converges to v_π(s)	Converges to v_π(s)
Variance	Higher (fewer samples)	Lower (more samples)
Common use	Most textbook algorithms	Sometimes preferred in practice

Both converge to the true value as the number of episodes grows. First-visit MC is simpler to analyze theoretically because each return sample is independent. Every-visit MC generates correlated samples (multiple returns from the same episode), but uses more data per episode.

In practice: The difference between first-visit and every-visit MC is often small. Sutton & Barto focus primarily on first-visit MC throughout Chapter 5, and it is the default when people say "Monte Carlo" in RL contexts.

First-Visit vs Every-Visit Comparison

A state is visited multiple times per episode. Watch how the two estimators converge. Orange = first-visit, Teal = every-visit, dashed = true value.

Check: What is the key difference between first-visit and every-visit MC?

First-visit uses only the return from the first occurrence of a state per episode First-visit is biased while every-visit is unbiased Every-visit requires a model of the environment

Chapter 3: The Blackjack Example

Blackjack is the classic MC example from Sutton & Barto. The state is your current sum (12–21), whether you have a usable ace, and the dealer's showing card (1–10). The policy is simple: hit on anything below 20, stick on 20 or 21. You play thousands of hands, record the returns (+1 win, -1 loss, 0 draw), and average them to build a value surface.

The result is the famous 3D plot from the book: a surface over (player sum, dealer showing) for each of the two cases (usable ace, no usable ace). The surface reveals which situations are favorable (+1) and which are hopeless (-1).

Why blackjack? It's a perfect MC domain. The transitions are stochastic (random cards), the state space is small enough to visualize, and there's no practical way to write down the full model p(s',r|s,a) because it depends on the composition of the remaining shoe. But you can easily simulate episodes. MC shines exactly in this scenario.

Blackjack: MC Value Surface

Play blackjack hands and watch the state-value surface build up. The surface shows V(player sum, dealer showing) for the stick-on-20 policy. Toggle between usable/no usable ace views.

View 0 episodes

After just a few hundred episodes, the surface is noisy — many states have been visited only once or twice. But after 10,000+ episodes, the characteristic shape from the textbook emerges: high values when the player has 20 or 21, a sharp cliff when the player is likely to bust, and interesting structure depending on the dealer's card.

Watch it build: The key insight is seeing the surface go from chaotic noise (few episodes) to the smooth, structured landscape (many episodes). This is MC prediction in action: each episode adds one data point to each visited state, and the law of large numbers does the rest.

Check: Why is blackjack a natural fit for Monte Carlo methods?

Because blackjack has only two actions Because the transition probabilities depend on the shoe composition, making a model impractical, but episodes are easy to simulate Because blackjack episodes are always very short

Chapter 4: Estimating Action Values

For control (finding optimal policies), we need action values q_π(s,a), not just state values v_π(s). Why? Because without a model, we can't do the one-step lookahead that DP uses to extract a policy from values. With action values, the policy is simply: pick the action with the highest q(s,a).

Estimating q(s,a) with MC works the same way as for states: average the returns following visits to the state-action pair (s,a). But there's a problem: if the policy is deterministic, many state-action pairs will never be visited. If the policy always takes action "stick" in some state, we'll never see returns for action "hit" in that state.

The exploration problem: To estimate q(s,a) for all actions, every state-action pair must be visited. A deterministic policy only visits one action per state. We need exploration. This is the same dilemma from Chapter 2 (bandits) — exploitation vs exploration — but now in a sequential setting.

One solution is exploring starts: guarantee that every state-action pair has a nonzero probability of being selected as the starting pair. Each episode begins at a randomly chosen (s,a), then follows the policy from there. This ensures all pairs are sampled eventually.

Exploring starts is sometimes practical (in simulation, you can start anywhere) and sometimes not (a physical robot can't teleport). Later in this chapter we'll see how epsilon-soft policies and off-policy methods solve exploration without needing magical starting conditions.

q_π(s, a) = E_π[G_t | S_t = s, A_t = a]

We need	The problem	Solutions
q(s,a) for all (s,a)	Deterministic π doesn't visit all actions	Exploring starts
		ε-soft policies
		Off-policy methods

Check: Why do we need action values q(s,a) instead of state values v(s) for model-free control?

State values are always inaccurate Without a model, we can't do the lookahead needed to extract a policy from v(s) alone Action values converge faster than state values

Chapter 5: Monte Carlo Control

MC control follows the same GPI (Generalized Policy Iteration) pattern from Chapter 4: alternate between evaluation (estimate q_π) and improvement (make the policy greedy with respect to q). The difference is that evaluation uses MC averaging instead of DP sweeps.

Evaluate

Run episodes under π, average returns → Q(s,a)

↓ make greedy

Improve

π(s) ← argmax_a Q(s,a)

↻ repeat (GPI)

In DP, we could evaluate a policy to convergence (many sweeps), then improve. But MC evaluation requires infinitely many episodes to converge exactly. Must we wait forever between improvements?

No. We can improve the policy after every episode. After each episode, update Q for the visited pairs, then make the policy greedy. This interleaving of partial evaluation and improvement still converges — GPI is robust to this. In the extreme, we evaluate for a single episode, improve, and repeat.

Key insight: GPI does not require exact evaluation between improvements. MC control works by evaluating for one episode, improving, evaluating for one episode, improving — and the whole system still converges to the optimal policy. This is the same principle that makes value iteration work in DP.

The remaining challenge is exploration. If the policy becomes deterministic and greedy too early, it might lock onto a suboptimal action and never try alternatives. The next two chapters address this with exploring starts and epsilon-soft policies.

Check: How does MC control differ from DP's policy iteration?

MC control doesn't use the improvement step MC control requires a model for the improvement step MC control estimates Q via episode returns instead of model-based sweeps

Chapter 6: MC with Exploring Starts

The Monte Carlo ES (Exploring Starts) algorithm is the simplest complete MC control method. It guarantees exploration by starting each episode at a random state-action pair, then following the current greedy policy. After each episode, it updates Q and improves the policy.

pseudocode
# Monte Carlo ES (Exploring Starts)
Initialize Q(s,a) arbitrarily, π(s) arbitrarily
Initialize Returns(s,a) ← empty list

repeat forever:
  Choose S₀, A₀ randomly (exploring starts)
  Generate episode from S₀, A₀ following π
  G ← 0
  for t = T−1, T−2, …, 0:
    G ← γ · G + R_t+1
    if (S_t, A_t) not in {(S₀,A₀), …, (S_t−1,A_t−1)}:
      append G to Returns(S_t, A_t)
      Q(S_t, A_t) ← average(Returns(S_t, A_t))
      π(S_t) ← argmax_a Q(S_t, a)

MC ES converges to the optimal policy and optimal action values under the assumption of exploring starts and infinite episodes. The proof relies on the same GPI argument: each evaluation step moves Q closer to q_π, and each improvement step makes π at least as good.

Limitation: Exploring starts assumes you can begin an episode in any state-action pair. In many real-world problems this is impossible — you can't ask a self-driving car to start in the middle of an intersection executing a left turn. The next section removes this assumption.

MC ES on a Simple Grid

Watch MC ES find the optimal policy on a 4x4 grid. Random exploring starts ensure all state-action pairs get sampled. Arrows show the current greedy policy.

      0 episodes
    

Check: What assumption does MC ES make about exploration?

Every episode starts with a randomly chosen state-action pair The agent uses an epsilon-greedy policy The agent has a model to plan with

Chapter 7: On-Policy Control

How do we ensure exploration without exploring starts? The answer is ε-soft policies: policies that assign at least ε/|A(s)| probability to every action. The simplest example is ε-greedy, which takes the greedy action with probability 1 − ε + ε/|A(s)| and each non-greedy action with probability ε/|A(s)|.

This is an on-policy approach: the policy being evaluated and improved is the same one generating behavior. We learn about the ε-greedy policy by following the ε-greedy policy.

π(a|s) =
1 − ε + ε/|A(s)| if a = argmax_a' Q(s,a')
ε/|A(s)| otherwise

The tradeoff: ε-soft policies guarantee exploration (every action has nonzero probability), so we never need exploring starts. The price is that the optimal policy within the class of ε-soft policies is not truly optimal — it still explores with probability ε, even when it knows the best action. We converge to the best ε-soft policy, not to π_*.

The on-policy MC control algorithm is identical to MC ES, except: (1) we don't need exploring starts, and (2) after computing argmax_a Q(s,a), we set the policy to ε-greedy rather than fully greedy. Convergence to the optimal ε-soft policy is guaranteed.

pseudocode
# On-Policy MC Control (ε-soft)
Initialize Q(s,a) arbitrarily, π ← ε-greedy w.r.t. Q

repeat forever:
  Generate episode following π
  G ← 0
  for t = T−1, T−2, …, 0:
    G ← γ · G + R_t+1
    if (S_t, A_t) first visit:
      append G to Returns(S_t, A_t)
      Q(S_t, A_t) ← average(Returns(S_t, A_t))
      π(S_t) ← ε-greedy w.r.t. Q(S_t, ·)

Check: What is the tradeoff of using ε-soft policies for on-policy MC?

We get exploration without exploring starts, but converge to the best ε-soft policy, not the truly optimal policy We converge to the optimal policy but much slower We need a model for the improvement step

Chapter 8: Off-Policy Learning

What if we want to learn about the optimal policy while still exploring? The idea is to use two policies: a behavior policy b that generates episodes (and explores), and a target policy π that we're trying to evaluate and improve (and can be greedy).

This separation is called off-policy learning. The behavior policy explores (so all state-action pairs get visited), while the target policy is free to be deterministic and greedy. We learn about π from data generated by b.

Why off-policy matters: On-policy methods are limited to learning about the policy they follow, which must explore. Off-policy methods break this coupling. They also enable learning from data generated by other agents, from human demonstrations, or from old policies — a form of experience reuse.

The challenge: episodes generated by b don't have the same distribution of returns as π would produce. A trajectory that is likely under b might be rare under π, and vice versa. We need a correction. That correction is importance sampling.

Behavior Policy b

Generates the data. Must explore: b(a|s) > 0 wherever π(a|s) > 0

↓ importance sampling correction

Target Policy π

The policy we want to evaluate/improve. Can be greedy.

The requirement for off-policy learning is coverage: b(a|s) > 0 whenever π(a|s) > 0. If the target policy would take some action, the behavior policy must also sometimes take it. Otherwise, some state-action pairs would never appear in the data, and we couldn't estimate their values.

Check: What is the "coverage" requirement for off-policy learning?

The target policy must explore The behavior policy must be optimal The behavior policy must have nonzero probability wherever the target policy does

Chapter 9: Importance Sampling

Given a trajectory generated by b, how do we estimate what the return would have been under π? We weight each trajectory by the importance sampling ratio — the relative probability of that trajectory under π vs b:

ρ_t:T-1 = ∏_k=t^T-1 π(A_k|S_k) / b(A_k|S_k)

If π would have taken the same actions more often than b did, ρ > 1 (upweight). If less often, ρ < 1 (downweight). If π would never take some action that b took, ρ = 0 (ignore the trajectory entirely).

Notice the transition probabilities p(s'|s,a) cancel out — they appear identically in numerator and denominator. The ratio depends only on the two policies, not on the dynamics. This is why importance sampling works for model-free RL.

There are two ways to use ρ to form an estimate. Ordinary importance sampling takes a simple average of the weighted returns:

V_OIS(s) = ( ∑ ρ_t:T-1 G_t ) / n

Weighted importance sampling normalizes by the sum of the weights:

V_WIS(s) = ( ∑ ρ_t:T-1 G_t ) / ( ∑ ρ_t:T-1 )

	Ordinary IS	Weighted IS
Bias	Unbiased	Biased (but consistent)
Variance	Can be unbounded!	Bounded, typically much lower
Single-sample	Could be wildly off	Equals the sample (ratio = 1)
Preferred?	Rarely in practice	Almost always preferred

Variance is the killer: Ordinary IS is unbiased but can have enormous (even infinite) variance because the ratio ρ is a product of many terms. If the trajectory is long, ρ can explode or collapse. Weighted IS dramatically reduces this by self-normalizing. In practice, weighted IS is almost always preferred.

Ordinary vs Weighted Importance Sampling

Both estimators converge to the true value, but watch the variance difference. Orange = ordinary IS (wild swings), Teal = weighted IS (stable). Dashed = true value.

Check: Why does ordinary IS have higher variance than weighted IS?

Ordinary IS uses fewer samples The importance ratio ρ can be very large, and ordinary IS divides by a fixed n rather than normalizing by the sum of weights Weighted IS uses a model to reduce variance

Chapter 10: Off-Policy MC Control

Now we assemble the pieces: off-policy MC control with weighted importance sampling. The target policy π is greedy with respect to Q. The behavior policy b is any soft policy (e.g., ε-greedy) that provides coverage. We use weighted IS to correct the returns.

pseudocode
# Off-Policy MC Control (weighted IS)
Initialize Q(s,a) arbitrarily, C(s,a) ← 0
π(s) ← argmax_a Q(s,a)

repeat forever:
  Generate episode using behavior policy b
  G ← 0, W ← 1
  for t = T−1, T−2, …, 0:
    G ← γ · G + R_t+1
    C(S_t, A_t) ← C(S_t, A_t) + W
    Q(S_t, A_t) ← Q(S_t, A_t) + (W / C(S_t, A_t)) · [G − Q(S_t, A_t)]
    π(S_t) ← argmax_a Q(S_t, a)
    if A_t ≠ π(S_t) then break
    W ← W · 1/b(A_t|S_t)

The key trick is the cumulative weight C(s,a), which replaces the explicit denominator in weighted IS. The update Q ← Q + (W/C)(G − Q) is an incremental weighted average. The variable W accumulates the importance sampling ratio as we go backward through the episode.

Early termination: The loop breaks when A_t ≠ π(S_t). Why? Because π is greedy and deterministic, so π(A_t|S_t) = 0 for any non-greedy action, making ρ = 0. All earlier time steps in the episode would get zero weight, so we can skip them. This is a huge efficiency gain — we often only process the tail end of each episode.

This algorithm converges to q_* (the optimal action values) and π_* (the optimal policy), assuming b provides coverage and we run enough episodes. The coverage requirement means b must sometimes take every action that π might ever want — an ε-greedy behavior policy handles this naturally.

Check: Why does the off-policy MC control loop break when A_t ≠ π(S_t)?

Because the importance ratio becomes 0 (the greedy target would never take that action), making all earlier weights zero Because the episode has ended To save computation when the policy has converged

Chapter 11: Summary

Monte Carlo methods learn from complete episodes of experience without requiring a model of the environment. They estimate value functions by averaging sample returns, and control policies by interleaving MC evaluation with greedy improvement (GPI). The three major approaches are exploring starts, on-policy (ε-soft), and off-policy (importance sampling).

	DP (Ch 4)	MC (Ch 5)
Model required?	Yes — full p(s',r\|s,a)	No — only episodes
Bootstraps?	Yes — V(s') in updates	No — uses complete returns G
Episode required?	No	Yes — must terminate
State space sweep?	All states every sweep	Only visited states
Update target	Expected value (model-based)	Sample return (experience-based)

Strengths of MC:

• No model needed
• No bootstrapping bias
• Can focus on states of interest
• Less sensitive to violated Markov property
• Simple to understand and implement

Limitations of MC:

• Must wait until episode ends
• Not for continuing (non-episodic) tasks
• High variance (especially off-policy)
• Slow convergence with long episodes
• Exploring starts often impractical

What comes next: MC must wait for the episode to end. What if we could update after every single step? Chapter 6 introduces Temporal-Difference (TD) learning, which combines MC's model-free sampling with DP's bootstrapping: update value estimates based on other estimates, without waiting for the final return. TD methods are the foundation of modern RL algorithms like Q-learning and SARSA.

The big picture: DP, MC, and TD are three different ways to estimate value functions, distinguished by what information they use. DP uses the model, MC uses complete returns, TD uses one-step returns plus current estimates. Chapter 5 showed that you can drop the model entirely and still find optimal policies. Chapter 6 will show you can also drop the requirement for complete episodes.

"Monte Carlo simulation is an incredibly powerful technique
because it can give answers when nothing else can."
— Sheldon Ross

Check: What single change distinguishes TD (Chapter 6) from MC?

TD requires a model TD bootstraps — it updates from estimated values rather than waiting for the complete return TD cannot handle stochastic policies