Learning value functions and optimal policies directly from episodes of experience — no model required.
In Chapter 4 you learned Dynamic Programming: sweep through every state, apply the Bellman equation, converge to optimal values. Beautiful — but it requires a complete model of the environment. You must know every transition probability and every reward. What if you don't?
Imagine you're learning to play blackjack. You don't know the exact probability of drawing each card from the shoe. But you can play thousands of hands and track your winnings. Over time, you learn which situations are favorable. That is the Monte Carlo idea: learn from complete episodes of experience.
Monte Carlo (MC) methods require only experience — sequences of states, actions, and rewards from actual or simulated episodes. They need no model of the environment's dynamics. The critical requirement is that episodes must terminate: MC learns from complete returns, so it must wait until an episode ends.
There are two properties that make MC methods distinctive. First, they do not bootstrap — unlike DP, they do not update value estimates based on other value estimates. They wait for the real return. Second, they can focus on states of interest rather than sweeping the entire state space. If you only care about a few starting positions, MC can estimate their values without touching the rest.
The term "Monte Carlo" broadly refers to any method that uses random sampling to estimate a quantity. In RL, it means averaging complete episode returns to estimate value functions. The randomness comes from the policy and the environment — each episode unfolds differently.
DP uses the model to compute expected values. MC generates episodes and averages returns. Click "Run Episodes" to see the MC estimate converge to the true value (dashed line).
The prediction problem: given a policy π, estimate vπ(s). DP solved this with iterative sweeps using the model. MC solves it by a stunningly simple idea: play episodes under π, and average the returns that follow each visit to s.
Concretely, the return from time step t is the discounted sum of future rewards:
After generating many episodes, the value of state s is estimated as the average of all the returns observed after visiting s:
By the law of large numbers, this average converges to vπ(s) = Eπ[Gt | St = s] as the number of visits grows. No model needed. No bootstrapping. Just play and average.
There is a subtlety: a state might appear multiple times in a single episode. Do we average the return from every visit, or only the first? This gives rise to two variants — first-visit MC and every-visit MC — which we explore next.
pseudocode # MC Prediction (first-visit) Input: policy π Initialize V(s) arbitrarily, Returns(s) ← empty list repeat forever: Generate episode under π: S0, A0, R1, S1, …, ST G ← 0 for t = T−1, T−2, …, 0: G ← γ · G + Rt+1 if St not in {S0, …, St−1}: append G to Returns(St) V(St) ← average(Returns(St))
When a state s appears multiple times in an episode, we have a choice. First-visit MC averages only the return following the first time s is visited in each episode. Every-visit MC averages the returns following every time s is visited.
| First-Visit MC | Every-Visit MC | |
|---|---|---|
| Which returns? | Only from first visit per episode | From every visit in every episode |
| Bias | Unbiased | Biased (but consistent) |
| Convergence | Converges to vπ(s) | Converges to vπ(s) |
| Variance | Higher (fewer samples) | Lower (more samples) |
| Common use | Most textbook algorithms | Sometimes preferred in practice |
Both converge to the true value as the number of episodes grows. First-visit MC is simpler to analyze theoretically because each return sample is independent. Every-visit MC generates correlated samples (multiple returns from the same episode), but uses more data per episode.
A state is visited multiple times per episode. Watch how the two estimators converge. Orange = first-visit, Teal = every-visit, dashed = true value.
Blackjack is the classic MC example from Sutton & Barto. The state is your current sum (12–21), whether you have a usable ace, and the dealer's showing card (1–10). The policy is simple: hit on anything below 20, stick on 20 or 21. You play thousands of hands, record the returns (+1 win, -1 loss, 0 draw), and average them to build a value surface.
The result is the famous 3D plot from the book: a surface over (player sum, dealer showing) for each of the two cases (usable ace, no usable ace). The surface reveals which situations are favorable (+1) and which are hopeless (-1).
Play blackjack hands and watch the state-value surface build up. The surface shows V(player sum, dealer showing) for the stick-on-20 policy. Toggle between usable/no usable ace views.
After just a few hundred episodes, the surface is noisy — many states have been visited only once or twice. But after 10,000+ episodes, the characteristic shape from the textbook emerges: high values when the player has 20 or 21, a sharp cliff when the player is likely to bust, and interesting structure depending on the dealer's card.
For control (finding optimal policies), we need action values qπ(s,a), not just state values vπ(s). Why? Because without a model, we can't do the one-step lookahead that DP uses to extract a policy from values. With action values, the policy is simply: pick the action with the highest q(s,a).
Estimating q(s,a) with MC works the same way as for states: average the returns following visits to the state-action pair (s,a). But there's a problem: if the policy is deterministic, many state-action pairs will never be visited. If the policy always takes action "stick" in some state, we'll never see returns for action "hit" in that state.
One solution is exploring starts: guarantee that every state-action pair has a nonzero probability of being selected as the starting pair. Each episode begins at a randomly chosen (s,a), then follows the policy from there. This ensures all pairs are sampled eventually.
Exploring starts is sometimes practical (in simulation, you can start anywhere) and sometimes not (a physical robot can't teleport). Later in this chapter we'll see how epsilon-soft policies and off-policy methods solve exploration without needing magical starting conditions.
| We need | The problem | Solutions |
|---|---|---|
| q(s,a) for all (s,a) | Deterministic π doesn't visit all actions | Exploring starts |
| ε-soft policies | ||
| Off-policy methods |
MC control follows the same GPI (Generalized Policy Iteration) pattern from Chapter 4: alternate between evaluation (estimate qπ) and improvement (make the policy greedy with respect to q). The difference is that evaluation uses MC averaging instead of DP sweeps.
In DP, we could evaluate a policy to convergence (many sweeps), then improve. But MC evaluation requires infinitely many episodes to converge exactly. Must we wait forever between improvements?
No. We can improve the policy after every episode. After each episode, update Q for the visited pairs, then make the policy greedy. This interleaving of partial evaluation and improvement still converges — GPI is robust to this. In the extreme, we evaluate for a single episode, improve, and repeat.
The remaining challenge is exploration. If the policy becomes deterministic and greedy too early, it might lock onto a suboptimal action and never try alternatives. The next two chapters address this with exploring starts and epsilon-soft policies.
The Monte Carlo ES (Exploring Starts) algorithm is the simplest complete MC control method. It guarantees exploration by starting each episode at a random state-action pair, then following the current greedy policy. After each episode, it updates Q and improves the policy.
pseudocode # Monte Carlo ES (Exploring Starts) Initialize Q(s,a) arbitrarily, π(s) arbitrarily Initialize Returns(s,a) ← empty list repeat forever: Choose S0, A0 randomly (exploring starts) Generate episode from S0, A0 following π G ← 0 for t = T−1, T−2, …, 0: G ← γ · G + Rt+1 if (St, At) not in {(S0,A0), …, (St−1,At−1)}: append G to Returns(St, At) Q(St, At) ← average(Returns(St, At)) π(St) ← argmaxa Q(St, a)
MC ES converges to the optimal policy and optimal action values under the assumption of exploring starts and infinite episodes. The proof relies on the same GPI argument: each evaluation step moves Q closer to qπ, and each improvement step makes π at least as good.
Watch MC ES find the optimal policy on a 4x4 grid. Random exploring starts ensure all state-action pairs get sampled. Arrows show the current greedy policy.
How do we ensure exploration without exploring starts? The answer is ε-soft policies: policies that assign at least ε/|A(s)| probability to every action. The simplest example is ε-greedy, which takes the greedy action with probability 1 − ε + ε/|A(s)| and each non-greedy action with probability ε/|A(s)|.
This is an on-policy approach: the policy being evaluated and improved is the same one generating behavior. We learn about the ε-greedy policy by following the ε-greedy policy.
The on-policy MC control algorithm is identical to MC ES, except: (1) we don't need exploring starts, and (2) after computing argmaxa Q(s,a), we set the policy to ε-greedy rather than fully greedy. Convergence to the optimal ε-soft policy is guaranteed.
pseudocode # On-Policy MC Control (ε-soft) Initialize Q(s,a) arbitrarily, π ← ε-greedy w.r.t. Q repeat forever: Generate episode following π G ← 0 for t = T−1, T−2, …, 0: G ← γ · G + Rt+1 if (St, At) first visit: append G to Returns(St, At) Q(St, At) ← average(Returns(St, At)) π(St) ← ε-greedy w.r.t. Q(St, ·)
What if we want to learn about the optimal policy while still exploring? The idea is to use two policies: a behavior policy b that generates episodes (and explores), and a target policy π that we're trying to evaluate and improve (and can be greedy).
This separation is called off-policy learning. The behavior policy explores (so all state-action pairs get visited), while the target policy is free to be deterministic and greedy. We learn about π from data generated by b.
The challenge: episodes generated by b don't have the same distribution of returns as π would produce. A trajectory that is likely under b might be rare under π, and vice versa. We need a correction. That correction is importance sampling.
The requirement for off-policy learning is coverage: b(a|s) > 0 whenever π(a|s) > 0. If the target policy would take some action, the behavior policy must also sometimes take it. Otherwise, some state-action pairs would never appear in the data, and we couldn't estimate their values.
Given a trajectory generated by b, how do we estimate what the return would have been under π? We weight each trajectory by the importance sampling ratio — the relative probability of that trajectory under π vs b:
If π would have taken the same actions more often than b did, ρ > 1 (upweight). If less often, ρ < 1 (downweight). If π would never take some action that b took, ρ = 0 (ignore the trajectory entirely).
Notice the transition probabilities p(s'|s,a) cancel out — they appear identically in numerator and denominator. The ratio depends only on the two policies, not on the dynamics. This is why importance sampling works for model-free RL.
There are two ways to use ρ to form an estimate. Ordinary importance sampling takes a simple average of the weighted returns:
Weighted importance sampling normalizes by the sum of the weights:
| Ordinary IS | Weighted IS | |
|---|---|---|
| Bias | Unbiased | Biased (but consistent) |
| Variance | Can be unbounded! | Bounded, typically much lower |
| Single-sample | Could be wildly off | Equals the sample (ratio = 1) |
| Preferred? | Rarely in practice | Almost always preferred |
Both estimators converge to the true value, but watch the variance difference. Orange = ordinary IS (wild swings), Teal = weighted IS (stable). Dashed = true value.
Now we assemble the pieces: off-policy MC control with weighted importance sampling. The target policy π is greedy with respect to Q. The behavior policy b is any soft policy (e.g., ε-greedy) that provides coverage. We use weighted IS to correct the returns.
pseudocode # Off-Policy MC Control (weighted IS) Initialize Q(s,a) arbitrarily, C(s,a) ← 0 π(s) ← argmaxa Q(s,a) repeat forever: Generate episode using behavior policy b G ← 0, W ← 1 for t = T−1, T−2, …, 0: G ← γ · G + Rt+1 C(St, At) ← C(St, At) + W Q(St, At) ← Q(St, At) + (W / C(St, At)) · [G − Q(St, At)] π(St) ← argmaxa Q(St, a) if At ≠ π(St) then break W ← W · 1/b(At|St)
The key trick is the cumulative weight C(s,a), which replaces the explicit denominator in weighted IS. The update Q ← Q + (W/C)(G − Q) is an incremental weighted average. The variable W accumulates the importance sampling ratio as we go backward through the episode.
This algorithm converges to q* (the optimal action values) and π* (the optimal policy), assuming b provides coverage and we run enough episodes. The coverage requirement means b must sometimes take every action that π might ever want — an ε-greedy behavior policy handles this naturally.
Monte Carlo methods learn from complete episodes of experience without requiring a model of the environment. They estimate value functions by averaging sample returns, and control policies by interleaving MC evaluation with greedy improvement (GPI). The three major approaches are exploring starts, on-policy (ε-soft), and off-policy (importance sampling).
| DP (Ch 4) | MC (Ch 5) | |
|---|---|---|
| Model required? | Yes — full p(s',r|s,a) | No — only episodes |
| Bootstraps? | Yes — V(s') in updates | No — uses complete returns G |
| Episode required? | No | Yes — must terminate |
| State space sweep? | All states every sweep | Only visited states |
| Update target | Expected value (model-based) | Sample return (experience-based) |
Strengths of MC:
• No model needed
• No bootstrapping bias
• Can focus on states of interest
• Less sensitive to violated Markov property
• Simple to understand and implement
Limitations of MC:
• Must wait until episode ends
• Not for continuing (non-episodic) tasks
• High variance (especially off-policy)
• Slow convergence with long episodes
• Exploring starts often impractical
What comes next: MC must wait for the episode to end. What if we could update after every single step? Chapter 6 introduces Temporal-Difference (TD) learning, which combines MC's model-free sampling with DP's bootstrapping: update value estimates based on other estimates, without waiting for the final return. TD methods are the foundation of modern RL algorithms like Q-learning and SARSA.