Formalizing the reinforcement learning problem: states, actions, rewards, and the equations that tie them together.
In Chapter 2 you pulled slot machine arms and collected rewards. But nothing changed — the machines had no memory, no state. Now imagine something richer: a robot navigating a warehouse, a chess player pondering the board, a thermostat adjusting temperature. In each case, what the agent should do depends on where it is.
This is the fundamental loop of reinforcement learning. At each time step t, the agent observes a state St, takes an action At, and the environment responds with a reward Rt+1 and a new state St+1. Then the cycle repeats. Every RL problem — games, robotics, recommendation systems — fits this loop.
Watch the interaction cycle. Click Step to advance one time step, or Auto to run continuously.
The loop from Chapter 0 is intuitive, but we need mathematics to reason about it precisely. A Markov Decision Process (MDP) is a formal description of that loop. It's defined by a set of states S, a set of actions A, and a single function that captures all of the environment's behavior:
This is the dynamics function. It gives the probability of transitioning to state s' and receiving reward r, given the current state s and action a. This one function specifies everything about the environment — the transition probabilities, the expected rewards, everything.
The word Markov means that the future depends only on the current state, not on the history of how you got there. The state St captures all relevant information. A chess position is Markov: you don't need to remember the sequence of moves that produced it, only the current board.
From p(s', r | s, a) we can derive anything we need. For example, the state-transition probabilities:
And the expected reward for a state-action pair:
How do we tell an agent what we want it to achieve? In RL, the answer is elegantly simple: through a single scalar number — the reward. At every time step, the environment sends the agent a reward signal. The agent's entire purpose is to maximize the total reward it accumulates over time.
This is the reward hypothesis: all goals and purposes can be expressed as the maximization of the expected value of a cumulative scalar reward signal. Want a robot to walk? Give +1 for each second standing, -100 for falling. Want a chess agent? +1 for a win, -1 for a loss, 0 otherwise.
Notice the subtle art of reward design. The reward should signal what you want achieved, not how to achieve it. Giving a chess agent +1 for capturing pieces might lead to strategies that capture pieces recklessly instead of winning. The reward for winning the game is cleaner — it lets the agent discover its own strategy.
A simple 5-cell grid. The agent starts on the left. Click cells to set their reward, then watch how different rewards change behavior.
The agent wants to maximize reward, but reward over what? Not just the immediate reward — an agent that grabs the nearest cookie ignoring the cake behind it is short-sighted. We need to formalize the total reward over time. This is called the return.
The simplest case: the interaction naturally breaks into episodes — sequences that end in a terminal state. A game of chess is one episode. One maze traversal is one episode. For episodic tasks, the return is:
where T is the final time step. But what about tasks that never end? A thermostat runs forever. A stock-trading agent has no final day. These are continuing tasks, and the simple sum above could blow up to infinity. We need a way to keep the return finite — that's where discounting enters (next chapter).
A crucial recursive property of the return that will power everything in later chapters:
The return from now equals the immediate reward plus the discounted return from the next step. This simple recursion is the seed of the Bellman equation.
The discount factor γ (gamma) is a number between 0 and 1 that controls how far into the future the agent looks. When γ = 0, the agent is completely myopic — it only cares about the very next reward. When γ = 1, it cares about all future rewards equally. Values in between give a smooth tradeoff.
Think of γ as a "patience knob." A low γ means "I want my reward now." A high γ means "I'm willing to wait for a bigger payoff later." Most practical RL uses γ between 0.9 and 0.99.
Drag γ to see how future rewards get weighted. Each bar shows γk — the weight of the reward k steps in the future.
With γ = 0.9, a reward 10 steps away is worth 0.910 ≈ 0.35 of its face value. With γ = 0.5, that same reward is worth 0.510 ≈ 0.001 — nearly worthless. The choice of γ fundamentally shapes the agent's strategy.
| γ | Horizon | Behavior |
|---|---|---|
| 0 | 1 step | Completely myopic. Grabs nearest reward. |
| 0.5 | ~2 steps | Short-sighted. Ignores anything beyond a couple steps. |
| 0.9 | ~10 steps | Moderate foresight. Common in practice. |
| 0.99 | ~100 steps | Far-sighted. Considers long-term consequences. |
| 1.0 | ∞ | Infinite horizon. Only works for episodic tasks. |
We've defined states, actions, rewards, and returns. Now the central question: how does the agent decide what to do? The answer is a policy. A policy π is a mapping from states to actions (or to probabilities over actions). It completely specifies the agent's behavior.
A deterministic policy assigns a single action to each state: π(s) = a. "In state 3, always go left." A stochastic policy assigns probabilities: π(a|s) = probability of taking action a in state s. "In state 3, go left 70% of the time, right 30%."
A 3x3 grid. Each cell shows the policy as arrows. Toggle between deterministic and stochastic to see the difference.
The entire goal of RL is to find a good policy — one that maximizes expected return. Everything we build from here (value functions, Bellman equations, optimization algorithms) serves this single purpose: finding the best policy.
A policy tells the agent what to do. But how do we know if a policy is good? We need a way to evaluate states and actions under a given policy. This is what value functions do.
The state-value function vπ(s) tells us the expected return starting from state s and following policy π thereafter. Intuitively: "how good is it to be in state s if I follow this policy?"
The action-value function qπ(s, a) adds one more dimension — it tells us the expected return of taking action a in state s and then following π:
A 4x4 grid with a goal (green) and a pit (red). Values show how good each cell is under a random policy. Brighter = higher value.
Notice how states near the goal have high values and states near the pit have low (or negative) values. The value function captures the long-term desirability of states — not just the immediate reward, but the entire future trajectory.
Here's the insight that makes everything computable. Remember the recursive property of returns: Gt = Rt+1 + γGt+1. If we take the expected value of both sides, we get a relationship between the value of a state and the values of its successor states. This is the Bellman equation:
Read it aloud: the value of state s equals the weighted sum over all actions (weighted by the policy) of the weighted sum over all possible next states and rewards (weighted by the dynamics) of the immediate reward plus the discounted value of the next state.
This equation says: "I don't need to simulate infinite trajectories to know a state's value. I just need to look one step ahead and use the values of my neighbors." It turns an infinite-horizon problem into a local computation.
We can visualize this with a backup diagram. The open circle is a state, the filled circle is an action, and the arrows show the flow of the computation: from state to action (policy), from action to next state (dynamics), and back up (the "backup").
Click any state (circle) in the grid to see its Bellman backup computation. The selected state highlights, and the backup diagram shows how its value is computed from neighbors.
The Bellman equation is the foundation of almost every RL algorithm. Dynamic programming (Ch 4) solves it exactly. Monte Carlo (Ch 5) and TD learning (Ch 6) approximate it from experience. Every method in this book traces back to this one equation.
So far we've evaluated a given policy. But what we really want is the best policy. A policy π is better than or equal to policy π' if its expected return is greater or equal in every state. There is always at least one policy that is better than or equal to all others — the optimal policy π*.
The optimal state-value function v* gives the maximum value achievable in each state over all policies:
The optimal action-value function q* gives the maximum action-value:
The Bellman optimality equation for v* replaces the policy averaging with a max — the optimal policy always picks the best action:
And for q*:
If we could solve the Bellman optimality equation, we'd have v* and could read off the optimal policy: in each state, pick the action that achieves the max. The problem? For large state spaces, solving it directly is intractable. That's what the rest of the book is about.
Let's see the Bellman optimality equation at work. Below is a 4x4 gridworld. The agent starts in the top-left and wants to reach the goal in the bottom-right. Each step costs -1 (to encourage finding the shortest path). We can compute v* for every cell and then derive the optimal policy by following the greedy action in each state.
Click Iterate to run one sweep of value iteration (Bellman optimality updates). Watch the values converge and the optimal policy emerge. The arrows show the greedy policy: the best action in each state given current values.
This chapter formalized the reinforcement learning problem as a Markov Decision Process. We defined the agent-environment interface, the dynamics function, policies, value functions, and two sets of fundamental equations: the Bellman equations (for evaluating a policy) and the Bellman optimality equations (for finding the best policy).
The Bellman optimality equation gives us the target — what optimal values look like. But solving it directly requires knowing the dynamics p(s',r|s,a) and sweeping over all states. For anything beyond small toy problems, this is intractable. The rest of the book develops practical methods:
| Approach | Chapter | Key Idea |
|---|---|---|
| Dynamic Programming | Ch 4 | Iteratively apply Bellman updates (requires model) |
| Monte Carlo | Ch 5 | Estimate values from sampled episodes |
| TD Learning | Ch 6 | Bootstrap from one step of experience |
| n-step Methods | Ch 7 | Bridge between MC and TD |
| Planning | Ch 8 | Learn a model, then plan with it |