The mathematical framework behind every decision-making AI — from game-playing agents to robotic navigation to how LLMs choose their next token.
Imagine you're a robot in a grid world. Each cell has a reward: some cells are gold (+1), some are pits (-1), and every step you take costs a little energy (-0.04). Your goal: find the best path from any starting cell to the gold, while avoiding the pit.
This sounds simple, but it gets tricky fast. What if your wheels slip? What if the grid is enormous? What if rewards are far away and you need to plan many steps ahead? You need a framework for thinking about sequential decisions under uncertainty. That framework is the Markov Decision Process (MDP).
Click any empty cell to place the robot, then watch it navigate toward the goal while avoiding the pit. The robot follows the optimal policy.
An MDP starts with two ingredients. States are all the situations the agent can be in. In our grid world, each cell is a state. Actions are what the agent can do from each state. In the grid, the actions are: up, down, left, right.
The classic grid world is a 4×3 grid with 11 accessible cells (one cell is a wall). The agent starts somewhere and must reach the goal cell at (3,2) while avoiding the pit at (3,1). The wall at (1,1) is impassable.
Hover over cells to see their coordinates. The dark cell is the wall. Green = goal. Red = pit.
| Symbol | Meaning | Grid Example |
|---|---|---|
| S | Set of all states | 11 accessible grid cells |
| A | Set of all actions | {Up, Down, Left, Right} |
| s0 | Start state | Any non-terminal cell |
| S+ | Terminal states | Goal (3,2) and Pit (3,1) |
Here's where MDPs get interesting. When you tell the robot "go up," it doesn't always go up. Real robots have noisy actuators: wheels slip, joints wobble, wind blows. We model this with a transition function P(s'|s,a): the probability of ending up in state s' given that you're in state s and take action a.
In our grid world, the standard model is: 80% of the time, the action works as intended. 10% of the time, you slip left (perpendicular). 10% of the time, you slip right. If you'd hit a wall, you stay in place. This stochasticity is what makes planning hard — and what makes MDPs powerful.
[current_state, action, next_state]. For our 4×3 grid with 11 states and 4 actions: T is [11, 4, 11] = 484 entries. Each slice T[s, a, :] is a probability distribution over next states (sums to 1.0). For a robot with 1000 discretized states and 8 actions: T is [1000, 8, 1000] = 8 million entries. And that's just a coarse discretization — a continuous-state robot needs function approximation because you can't store the full tensor.Python # Transition tensor for our 4x3 grid world import numpy as np T = np.zeros((11, 4, 11)) # [S, A, S'] = 484 entries # Example: from state 0, action=Up (noise=0.1) T[0, UP, 4] = 0.8 # intended: move to (0,1) T[0, UP, 0] = 0.1 # slip left: hit west wall, stay T[0, UP, 1] = 0.1 # slip right: move to (1,0) # Every row sums to 1.0 assert np.allclose(T[0, UP, :].sum(), 1.0)
Click a cell, then click an action. Watch the probability fan: 80% intended, 10% slip left, 10% slip right. Adjust the noise to see how stochasticity changes things.
Let's trace one concrete transition. The agent is at state (2,0), which is the bottom row, third cell from the left. It chooses action Up with noise = 0.1:
Transition trace # Agent at (2,0), action = Up, noise = 10% # Intended (80%): move up to (2,1) T[(2,0), Up, (2,1)] = 0.80 # Slip left — perpendicular (10%): move left to (1,0) # But (1,1) is a WALL! So we try (1,0) — that's valid. T[(2,0), Up, (1,0)] = 0.10 # Slip right — perpendicular (10%): move right to (3,0) T[(2,0), Up, (3,0)] = 0.10 # Sum: 0.80 + 0.10 + 0.10 = 1.00 ✓ # Every T[s, a, :] row must sum to exactly 1.0
The reward function R(s) tells the agent what's good and what's bad. In our grid world: the goal cell gives +1, the pit gives -1, and every other step gives -0.04. That small negative reward for each step is crucial — it encourages the agent to reach the goal quickly rather than wandering forever.
The step penalty is a design choice with deep consequences. Make it too small, and the agent dawdles. Make it too large (close to -1), and the agent might prefer to jump into the pit immediately rather than risk a long journey. The reward function shapes behavior.
Adjust the step cost and watch how the reward landscape changes. The color encodes reward intensity: green = positive, red = negative.
An agent doesn't just care about the next reward — it cares about the total reward it will collect over its entire journey. We call this the return G. But should a reward 100 steps from now count the same as a reward right now?
Usually not. We introduce a discount factor γ (gamma), between 0 and 1, that shrinks future rewards. A reward k steps away is multiplied by γk. With γ = 0.9, a reward 10 steps away is worth only 0.910 ≈ 0.35 of its face value.
A fixed reward of +1 at each future step. Watch how γ determines what the agent "sees" into the future. Bars show the discounted value at each step.
| γ | Effective Horizon | Behavior |
|---|---|---|
| 0.5 | ~2 steps | Extremely myopic, grabs nearest reward |
| 0.9 | ~10 steps | Short-range planner |
| 0.99 | ~100 steps | Medium-range, good for most grid worlds |
| 0.999 | ~1000 steps | Long-range, needed for complex tasks |
| 1.0 | ∞ | May diverge for continuing tasks |
The value function V(s) answers a fundamental question: "if I'm in state s and I act optimally from here on, what total (discounted) return can I expect?" A high V(s) means the state is desirable. A low V(s) means it's dangerous or far from the goal.
The value function satisfies a beautiful recursive relationship called the Bellman equation. The value of a state equals the best action's immediate reward plus the discounted expected value of the next state. This is the foundation of all MDP solving algorithms.
Let's compute V* for cell (2,1) by hand. Assume γ = 0.9, step cost = −0.04, noise = 0.1, and that the neighboring values have already converged: V*(2,2) = 0.78, V*(3,1) = −1.0 (pit!), V*(1,0) = 0.40, V*(2,0) = 0.46.
Bellman backup for (2,1) # Action = Up → to (2,2) with prob 0.8 Q(Up) = 0.8 * (-0.04 + 0.9 * 0.78) # intended: (2,2) + 0.1 * (-0.04 + 0.9 * V(1,1)) # slip left: wall → stay + 0.1 * (-0.04 + 0.9 * (-1.0)) # slip right: PIT! = 0.8 * 0.662 + 0.1 * V_stay + 0.1 * (-0.94) # Action = Right → toward (3,1) which is the PIT Q(Right) = 0.8 * (-0.04 + 0.9 * (-1.0)) + ... # Very negative! Avoid this action. # V*(2,1) = max over all 4 actions # The best action is Up → V*(2,1) ≈ 0.53
Notice how the pit at (3,1) drags down the value of action Right. The optimal policy will steer away from the pit, even though going right is geometrically closer to the goal. This is the Bellman equation at work: it accounts for risk.
Click a cell to see how its value is computed from its neighbors. The Bellman backup "looks ahead" one step in each direction, weighted by transition probabilities.
We defined the return as Gt = Rt+1 + γRt+2 + γ²Rt+3 + ... and the value function as Vπ(s) = Eπ[Gt | St = s]. An optimal agent picks the best action in every state.
Your task: Starting from the definition of V*(s) = maxπ Vπ(s), derive the recursive Bellman optimality equation: V*(s) = maxa ∑s' P(s'|s,a) [R(s,a,s') + γ V*(s')].
Full derivation:
1. Start with V*(s) = maxπ Eπ[Gt | St = s]. An optimal policy must pick the best action at each step, so:
V*(s) = maxa E[Gt | St = s, At = a]
2. Decompose the return: Gt = Rt+1 + γGt+1. Substituting:
V*(s) = maxa E[Rt+1 + γGt+1 | St = s, At = a]
3. Expand the expectation over the transition distribution P(s'|s,a):
V*(s) = maxa ∑s' P(s'|s,a) [R(s,a,s') + γ E[Gt+1 | St+1 = s']]
4. Under the optimal policy, E[Gt+1 | St+1 = s'] = V*(s'). Therefore:
V*(s) = maxa ∑s' P(s'|s,a) [R(s,a,s') + γ V*(s')]
The key insight: The derivation relies on the principle of optimality: an optimal policy from s must also be optimal from whatever s' it transitions to. This allows us to replace E[Gt+1] with V*(s'), making the equation self-referential (recursive). The "max" outside the sum means we pick the action BEFORE the stochastic transition resolves.
The recursion terminates because of two anchors: (1) Terminal states have fixed values (V*(goal) = +1, V*(pit) = -1) with no further transitions — they're the base cases. (2) The discount factor γ < 1 ensures that even in non-terminating tasks, the contribution of future states shrinks geometrically, so the infinite sum converges. Together, these mean the Bellman equation is a system of |S| simultaneous equations with |S| unknowns, and it has a unique solution (provable via the Banach fixed-point theorem). Without terminal states AND with γ = 1, the recursion genuinely doesn't resolve — values can be infinite or undefined.
We know the Bellman equation defines the optimal values, but how do we compute them? Value iteration is the workhorse algorithm. Start by initializing all values to zero (except terminals). Then, repeatedly apply the Bellman equation to every state: update each cell's value based on its neighbors' values. Repeat until the values converge.
Watch the magic below: values start at zero everywhere, then gradually propagate outward from the goal. Each iteration, the "wave" of positive values reaches one cell further. After convergence, every cell knows its optimal expected return.
Python def value_iteration(S, A, P, R, gamma, theta=1e-6): V = {s: 0 for s in S} # initialize all values to 0 while True: delta = 0 for s in S: if is_terminal(s): continue v = V[s] V[s] = max( # best action sum(P(s,a,s2) * (R(s,a,s2) + gamma * V[s2]) for s2 in S) for a in A ) delta = max(delta, abs(v - V[s])) if delta < theta: # converged! break return V
The inner loop can be expressed as a single matrix operation. For each action a, the expected value is R[:,a] + γ · T[s,a,:] @ V. The new V is the max across all actions. In NumPy:
Python import numpy as np # T shape: [|S|, |A|, |S|] — transition probabilities # R shape: [|S|, |A|] — expected reward for (state, action) # V shape: [|S|] — current value estimates # One Bellman backup in vectorized NumPy: Q = R + gamma * (T @ V) # shape [|S|, |A|] — Q-values V_new = np.max(Q, axis=1) # shape [|S|] — best action per state # For our 4x3 grid (11 states, 4 actions): # T @ V: [11, 4, 11] @ [11] → [11, 4] (expected next-state value) # R + gamma * (T @ V): [11, 4] (Q-values for each state-action) # np.max(..., axis=1): [11] (best Q per state = new V)
Value iteration applies the Bellman operator T repeatedly: Vk+1 = T[Vk], where T[V](s) = maxa ∑s' P(s'|s,a)[R + γV(s')]. We claim this converges to V* regardless of initialization.
Your task: Show that T is a γ-contraction in the max-norm: ||T[V] − T[U]||∞ ≤ γ ||V − U||∞. Then explain why this guarantees convergence.
Full derivation:
1. For any state s:
|T[V](s) − T[U](s)| = |maxa ∑s' P(s'|s,a)[R + γV(s')] − maxa ∑s' P(s'|s,a)[R + γU(s')]|
2. Apply |max f − max g| ≤ max |f − g|:
≤ maxa |∑s' P(s'|s,a)[R + γV(s') − R − γU(s')]|
= maxa γ |∑s' P(s'|s,a)[V(s') − U(s')]|
3. Triangle inequality + probabilities sum to 1:
≤ maxa γ ∑s' P(s'|s,a) |V(s') − U(s')|
≤ γ · ||V − U||∞ · ∑s' P(s'|s,a)
= γ · ||V − U||∞
4. Since this holds for all s: ||T[V] − T[U]||∞ ≤ γ ||V − U||∞
5. By the Banach fixed-point theorem, any γ-contraction on a complete metric space has a unique fixed point, and repeated application converges to it at rate O(γk).
The key insight: The discount factor γ < 1 is doing double duty: it makes returns finite AND it makes the Bellman operator a contraction. With γ = 1, the operator is merely non-expansive — convergence is not guaranteed (and indeed fails for cyclic MDPs with no terminal states).
The return is Gt = ∑k=0∞ γk Rt+k+1. If rewards are bounded (|R| ≤ Rmax), show that Gt is always finite when γ < 1, and compute its maximum possible value.
Your task: Prove |Gt| ≤ Rmax / (1 − γ) using the geometric series.
Full derivation:
|Gt| = |∑k=0∞ γk Rt+k+1| ≤ ∑k=0∞ γk |Rt+k+1| ≤ ∑k=0∞ γk Rmax = Rmax / (1 − γ)
For our grid world: Rmax = 1, γ = 0.9, so |Gt| ≤ 1/(1−0.9) = 10. No state value can exceed 10 or go below -10. With γ = 0.99: |Gt| ≤ 100.
With γ = 1: ∑k=0∞ 1k Rmax = ∞. The return is unbounded. For a cyclic MDP that never terminates and gives constant reward, G = ∞ or −∞. The value function is undefined.
The key insight: γ < 1 is not just a "preference for sooner rewards" — it's a mathematical necessity that guarantees the entire MDP framework is well-defined. Without it, V*(s) might not exist.
python def value_iteration(states, actions, transitions, rewards, gamma, theta=1e-6): V = {s: 0.0 for s in states} terminals = {s for s in states if not any((s, a, sp) in transitions for a in actions for sp in states)} while True: delta = 0.0 for s in states: if s in terminals: continue v_old = V[s] # Bellman backup: max over actions best = float('-inf') for a in actions: q = 0.0 for sp in states: p = transitions.get((s, a, sp), 0.0) if p > 0: r = rewards.get((s, a, sp), 0.0) q += p * (r + gamma * V[sp]) if q > best: best = q V[s] = best delta = max(delta, abs(v_old - best)) if delta < theta: break # Extract policy policy = {} for s in states: if s in terminals: continue best_a, best_val = None, float('-inf') for a in actions: q = sum(transitions.get((s,a,sp), 0) * (rewards.get((s,a,sp), 0) + gamma * V[sp]) for sp in states) if q > best_val: best_val, best_a = q, a policy[s] = best_a return V, policy
Value iteration gives us V*(s) for every state. But the agent needs to know what to do, not just how good each state is. The optimal policy π*(s) tells the agent which action to take in each state. Extracting it from V* is simple: for each state, pick the action that maximizes the expected value of the next state.
Below, watch the policy arrows emerge after value iteration converges. Each arrow points in the direction of the best action. Near the pit, arrows steer away. Near the goal, arrows point toward it. The policy is a complete strategy — it tells the agent what to do in every situation.
Click "Converge Values" to run value iteration to completion, then "Show Policy" to extract and display the optimal policy arrows. Adjust γ to see how it affects the policy.
State: Most successful approaches use a feature vector, NOT raw price (non-stationary!). Typical: [position_fraction, recent_returns (5-20 bars), volatility_estimate, spread, time_of_day]. Discretized into bins or fed to a neural network for function approximation. |S| is effectively infinite → model-free RL or fitted Q-iteration.
Actions: Discrete actions (e.g., target_position in {-1, -0.5, 0, 0.5, 1} of capital) dominate because continuous actions make exploration harder. The 0.1% transaction cost naturally penalizes excessive trading.
Reward: Raw P&L creates risk-seeking behavior (agent bets big hoping for rare wins). The field uses differential Sharpe ratio — a reward at each step that approximates the contribution to the overall Sharpe ratio. This balances return vs volatility incrementally. Alternatively: reward = return − λ · |drawdown| with λ tuned to risk appetite.
Gamma: Typically 0.95–0.99, but the real issue is non-stationarity. Markets shift regimes. A policy trained on bull market data fails in a crash. Practitioners retrain frequently or use meta-learning. Some use γ = 0 (pure bandit) for HFT since the "next state" is essentially unpredictable at millisecond scales.
Critical pitfall: The Markov property is violated! Today's price depends on yesterday's momentum, sentiment, order flow history. You must encode enough history in the state to make it approximately Markov (hence LSTM/attention in state encoders).
Both are recursive equations where the current answer depends on the next-step answer. The Bellman equation propagates value backward from the future. The Kalman filter propagates information forward from measurements. Both solve their respective problems by breaking an intractable global optimization into local one-step operations chained together.
Can you spot this pattern in other algorithms? (Hint: dynamic programming, backpropagation, message passing in graphical models all share this "local computation, global solution" structure.)
The grid world is a toy, but MDPs are everywhere. Any problem with sequential decisions under uncertainty can be framed as an MDP. The concepts you've learned — states, actions, transitions, rewards, values, and policies — are the language of decision-making in AI.
| Domain | States | Actions | Rewards |
|---|---|---|---|
| Robotics | Joint angles, positions | Motor torques | Task completion |
| Game AI | Board/screen config | Moves, buttons | Score, win/loss |
| LLM Tokens | Generated text so far | Next token choice | Human preference |
| Finance | Portfolio state | Buy/sell/hold | Returns |
| Dialogue | Conversation history | Response choices | User satisfaction |
MDPs assume the agent can fully observe its state. In the real world, you often can't. A robot might not know its exact position; a poker player can't see the opponent's cards. This leads to Partially Observable MDPs (POMDPs), which add a belief state over possible states.
MDPs also assume we know the transition function P and reward function R. When we don't, we enter the world of Reinforcement Learning (RL), where the agent must learn these through trial and error.
Q-learning replaces the known-model sum (∑s' P · [...]) with a sample from real experience. Instead of computing the expectation analytically, we observe one (s, a, r, s') tuple and nudge our Q-value toward it. The Bellman equation becomes a stochastic approximation. Same fixed point, different path to get there.
What about POMDPs? When the agent can't observe s directly, it maintains a belief distribution over states — the "belief MDP" where states are probability distributions. How does this connect to the Bayes filter? (Hint: the belief update IS a Bayes filter.)
You now understand how to frame and solve decision problems. Every game, every robot, every language model is making decisions — and now you know the math behind them.