Sutton & Barto, Chapter 3

Finite MDPs

Formalizing the reinforcement learning problem: states, actions, rewards, and the equations that tie them together.

Prerequisites: Chapter 2 (Multi-armed Bandits). Bandits were the appetizer — now we add states.
11
Chapters
6+
Simulations
11
Quizzes

Chapter 0: The Agent-Environment Interface

In Chapter 2 you pulled slot machine arms and collected rewards. But nothing changed — the machines had no memory, no state. Now imagine something richer: a robot navigating a warehouse, a chess player pondering the board, a thermostat adjusting temperature. In each case, what the agent should do depends on where it is.

This is the fundamental loop of reinforcement learning. At each time step t, the agent observes a state St, takes an action At, and the environment responds with a reward Rt+1 and a new state St+1. Then the cycle repeats. Every RL problem — games, robotics, recommendation systems — fits this loop.

The key upgrade from bandits: Your action doesn't just earn a reward — it also changes where you are. The agent must consider not just immediate payoff but the consequences of moving to a new state.
The Agent-Environment Loop

Watch the interaction cycle. Click Step to advance one time step, or Auto to run continuously.

Agent observes
State St
Agent selects
Action At
Environment returns
Reward Rt+1 and next state St+1
↻ repeat
In the agent-environment loop, what does the agent receive after taking action At in state St?

Chapter 1: The MDP Framework

The loop from Chapter 0 is intuitive, but we need mathematics to reason about it precisely. A Markov Decision Process (MDP) is a formal description of that loop. It's defined by a set of states S, a set of actions A, and a single function that captures all of the environment's behavior:

p(s', r | s, a) = Pr{St+1 = s', Rt+1 = r | St = s, At = a}

This is the dynamics function. It gives the probability of transitioning to state s' and receiving reward r, given the current state s and action a. This one function specifies everything about the environment — the transition probabilities, the expected rewards, everything.

The word Markov means that the future depends only on the current state, not on the history of how you got there. The state St captures all relevant information. A chess position is Markov: you don't need to remember the sequence of moves that produced it, only the current board.

The Markov property: Given the present state, the future is independent of the past. All the information the agent needs is compressed into St. This is what makes MDPs tractable — you don't need to remember the entire history.

From p(s', r | s, a) we can derive anything we need. For example, the state-transition probabilities:

p(s' | s, a) = ∑r ∈ R p(s', r | s, a)

And the expected reward for a state-action pair:

r(s, a) = E[Rt+1 | St = s, At = a] = ∑r ∈ R r ∑s' ∈ S p(s', r | s, a)
One function to rule them all. The dynamics function p(s', r | s, a) is the MDP's DNA. If you know this function, you know the entire environment. Many RL methods exist precisely because in practice we don't know it and must learn from experience.
What does the Markov property guarantee?

Chapter 2: Goals & Rewards

How do we tell an agent what we want it to achieve? In RL, the answer is elegantly simple: through a single scalar number — the reward. At every time step, the environment sends the agent a reward signal. The agent's entire purpose is to maximize the total reward it accumulates over time.

This is the reward hypothesis: all goals and purposes can be expressed as the maximization of the expected value of a cumulative scalar reward signal. Want a robot to walk? Give +1 for each second standing, -100 for falling. Want a chess agent? +1 for a win, -1 for a loss, 0 otherwise.

The reward hypothesis: That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (called reward). This is a philosophical claim, not a theorem — but it's the foundation of RL.

Notice the subtle art of reward design. The reward should signal what you want achieved, not how to achieve it. Giving a chess agent +1 for capturing pieces might lead to strategies that capture pieces recklessly instead of winning. The reward for winning the game is cleaner — it lets the agent discover its own strategy.

Reward Signal Explorer

A simple 5-cell grid. The agent starts on the left. Click cells to set their reward, then watch how different rewards change behavior.

According to the reward hypothesis, how should we specify goals?

Chapter 3: Returns & Episodes

The agent wants to maximize reward, but reward over what? Not just the immediate reward — an agent that grabs the nearest cookie ignoring the cake behind it is short-sighted. We need to formalize the total reward over time. This is called the return.

The simplest case: the interaction naturally breaks into episodes — sequences that end in a terminal state. A game of chess is one episode. One maze traversal is one episode. For episodic tasks, the return is:

Gt = Rt+1 + Rt+2 + Rt+3 + … + RT

where T is the final time step. But what about tasks that never end? A thermostat runs forever. A stock-trading agent has no final day. These are continuing tasks, and the simple sum above could blow up to infinity. We need a way to keep the return finite — that's where discounting enters (next chapter).

Gt = Rt+1 + γRt+2 + γ²Rt+3 + … = ∑k=0 γk Rt+k+1
Episodic vs continuing: If the task has natural endings (games, maze runs, robot trials), it's episodic. If it goes on indefinitely (process control, portfolio management), it's continuing. The discounted return formula handles both cases — for episodic tasks, we can set γ = 1 since the sum is finite.

A crucial recursive property of the return that will power everything in later chapters:

Gt = Rt+1 + γ Gt+1

The return from now equals the immediate reward plus the discounted return from the next step. This simple recursion is the seed of the Bellman equation.

Why can't we simply sum all future rewards (without discounting) for continuing tasks?

Chapter 4: The Discount Rate

The discount factor γ (gamma) is a number between 0 and 1 that controls how far into the future the agent looks. When γ = 0, the agent is completely myopic — it only cares about the very next reward. When γ = 1, it cares about all future rewards equally. Values in between give a smooth tradeoff.

Think of γ as a "patience knob." A low γ means "I want my reward now." A high γ means "I'm willing to wait for a bigger payoff later." Most practical RL uses γ between 0.9 and 0.99.

γ as a patience knob: γ = 0 is a toddler grabbing the nearest toy. γ = 0.99 is an investor choosing long-term growth over quick profits. The discount rate is perhaps the most important hyperparameter in RL.
Discount Factor Explorer

Drag γ to see how future rewards get weighted. Each bar shows γk — the weight of the reward k steps in the future.

Gamma γ0.90

With γ = 0.9, a reward 10 steps away is worth 0.910 ≈ 0.35 of its face value. With γ = 0.5, that same reward is worth 0.510 ≈ 0.001 — nearly worthless. The choice of γ fundamentally shapes the agent's strategy.

γHorizonBehavior
01 stepCompletely myopic. Grabs nearest reward.
0.5~2 stepsShort-sighted. Ignores anything beyond a couple steps.
0.9~10 stepsModerate foresight. Common in practice.
0.99~100 stepsFar-sighted. Considers long-term consequences.
1.0Infinite horizon. Only works for episodic tasks.
An agent with γ = 0 will:

Chapter 5: Policies

We've defined states, actions, rewards, and returns. Now the central question: how does the agent decide what to do? The answer is a policy. A policy π is a mapping from states to actions (or to probabilities over actions). It completely specifies the agent's behavior.

A deterministic policy assigns a single action to each state: π(s) = a. "In state 3, always go left." A stochastic policy assigns probabilities: π(a|s) = probability of taking action a in state s. "In state 3, go left 70% of the time, right 30%."

π(a | s) = Pr{At = a | St = s}
Why stochastic? Deterministic policies seem simpler, but stochastic policies are more general. In poker, a deterministic bluffing strategy is exploitable — your opponent learns your pattern. A stochastic policy that bluffs randomly is much harder to beat. Randomness can be optimal.
Policy Visualizer

A 3x3 grid. Each cell shows the policy as arrows. Toggle between deterministic and stochastic to see the difference.

The entire goal of RL is to find a good policy — one that maximizes expected return. Everything we build from here (value functions, Bellman equations, optimization algorithms) serves this single purpose: finding the best policy.

What does π(a|s) represent?

Chapter 6: Value Functions

A policy tells the agent what to do. But how do we know if a policy is good? We need a way to evaluate states and actions under a given policy. This is what value functions do.

The state-value function vπ(s) tells us the expected return starting from state s and following policy π thereafter. Intuitively: "how good is it to be in state s if I follow this policy?"

vπ(s) = Eπ[Gt | St = s] = Eπ[∑k=0 γk Rt+k+1 | St = s]

The action-value function qπ(s, a) adds one more dimension — it tells us the expected return of taking action a in state s and then following π:

qπ(s, a) = Eπ[Gt | St = s, At = a]
v vs q: The state-value function vπ answers "how good is this state?" The action-value function qπ answers "how good is this action in this state?" Both are expectations of the return — they just condition on different things. The q-function is what Q-learning (Chapter 6) is named after.
Value Function on a Grid

A 4x4 grid with a goal (green) and a pit (red). Values show how good each cell is under a random policy. Brighter = higher value.

Gamma γ0.90

Notice how states near the goal have high values and states near the pit have low (or negative) values. The value function captures the long-term desirability of states — not just the immediate reward, but the entire future trajectory.

What does vπ(s) measure?

Chapter 7: The Bellman Equation

Here's the insight that makes everything computable. Remember the recursive property of returns: Gt = Rt+1 + γGt+1. If we take the expected value of both sides, we get a relationship between the value of a state and the values of its successor states. This is the Bellman equation:

vπ(s) = ∑a π(a|s) ∑s', r p(s', r | s, a) [r + γ vπ(s')]

Read it aloud: the value of state s equals the weighted sum over all actions (weighted by the policy) of the weighted sum over all possible next states and rewards (weighted by the dynamics) of the immediate reward plus the discounted value of the next state.

This equation says: "I don't need to simulate infinite trajectories to know a state's value. I just need to look one step ahead and use the values of my neighbors." It turns an infinite-horizon problem into a local computation.

The Bellman equation in plain English: "The value of where I am equals the average of what I get immediately plus the discounted value of where I end up." It's a consistency condition — if the values are correct, this equation must hold for every state simultaneously.

We can visualize this with a backup diagram. The open circle is a state, the filled circle is an action, and the arrows show the flow of the computation: from state to action (policy), from action to next state (dynamics), and back up (the "backup").

Interactive Bellman Backup

Click any state (circle) in the grid to see its Bellman backup computation. The selected state highlights, and the backup diagram shows how its value is computed from neighbors.

The Bellman equation is the foundation of almost every RL algorithm. Dynamic programming (Ch 4) solves it exactly. Monte Carlo (Ch 5) and TD learning (Ch 6) approximate it from experience. Every method in this book traces back to this one equation.

The Bellman equation relates the value of a state to:

Chapter 8: Optimal Value Functions

So far we've evaluated a given policy. But what we really want is the best policy. A policy π is better than or equal to policy π' if its expected return is greater or equal in every state. There is always at least one policy that is better than or equal to all others — the optimal policy π*.

The optimal state-value function v* gives the maximum value achievable in each state over all policies:

v*(s) = maxπ vπ(s)   for all s ∈ S

The optimal action-value function q* gives the maximum action-value:

q*(s, a) = maxπ qπ(s, a)   for all s ∈ S, a ∈ A(s)

The Bellman optimality equation for v* replaces the policy averaging with a max — the optimal policy always picks the best action:

v*(s) = maxas', r p(s', r | s, a) [r + γ v*(s')]

And for q*:

q*(s, a) = ∑s', r p(s', r | s, a) [r + γ maxa' q*(s', a')]
From sum to max: Compare the Bellman equation (sum over π) with the Bellman optimality equation (max over a). The only difference is that the optimal agent doesn't average over actions — it picks the best one. This tiny change is the difference between evaluation and optimization.

If we could solve the Bellman optimality equation, we'd have v* and could read off the optimal policy: in each state, pick the action that achieves the max. The problem? For large state spaces, solving it directly is intractable. That's what the rest of the book is about.

How does the Bellman optimality equation differ from the regular Bellman equation?

Chapter 9: Bellman Optimality in Action

Let's see the Bellman optimality equation at work. Below is a 4x4 gridworld. The agent starts in the top-left and wants to reach the goal in the bottom-right. Each step costs -1 (to encourage finding the shortest path). We can compute v* for every cell and then derive the optimal policy by following the greedy action in each state.

Optimal Gridworld

Click Iterate to run one sweep of value iteration (Bellman optimality updates). Watch the values converge and the optimal policy emerge. The arrows show the greedy policy: the best action in each state given current values.

Gamma γ0.90
Iteration: 0 | Max change: —
Observe: After convergence, the values form a gradient — increasing as you approach the goal. The greedy policy arrows point along the shortest path. This is exactly what the Bellman optimality equation guarantees: once the values are correct, acting greedily is optimal.
In the converged gridworld, the greedy policy (following arrows derived from v*) gives:

Chapter 10: Summary & Connections

This chapter formalized the reinforcement learning problem as a Markov Decision Process. We defined the agent-environment interface, the dynamics function, policies, value functions, and two sets of fundamental equations: the Bellman equations (for evaluating a policy) and the Bellman optimality equations (for finding the best policy).

MDP Definition
States, Actions, p(s',r|s,a)
Evaluation
Bellman eq. → vπ(s)
Optimization
Bellman optimality → v*(s), π*

The Bellman optimality equation gives us the target — what optimal values look like. But solving it directly requires knowing the dynamics p(s',r|s,a) and sweeping over all states. For anything beyond small toy problems, this is intractable. The rest of the book develops practical methods:

ApproachChapterKey Idea
Dynamic ProgrammingCh 4Iteratively apply Bellman updates (requires model)
Monte CarloCh 5Estimate values from sampled episodes
TD LearningCh 6Bootstrap from one step of experience
n-step MethodsCh 7Bridge between MC and TD
PlanningCh 8Learn a model, then plan with it
The big picture: The MDP framework is the common language. Every RL method in the book is a different strategy for solving or approximating the Bellman equations — either with a model (DP, planning) or without one (MC, TD). Chapter 4 starts with the cleanest approach: dynamic programming.
Why can't we solve the Bellman optimality equation directly for real-world problems?