The Complete Beginner's Path

Understand Markov
Decision Processes

The mathematical framework behind every decision-making AI — from game-playing agents to robotic navigation to how LLMs choose their next token.

Prerequisites: Basic probability + Grid intuition. That's it.
9
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Decisions?

Imagine you're a robot in a grid world. Each cell has a reward: some cells are gold (+1), some are pits (-1), and every step you take costs a little energy (-0.04). Your goal: find the best path from any starting cell to the gold, while avoiding the pit.

This sounds simple, but it gets tricky fast. What if your wheels slip? What if the grid is enormous? What if rewards are far away and you need to plan many steps ahead? You need a framework for thinking about sequential decisions under uncertainty. That framework is the Markov Decision Process (MDP).

The core idea: An MDP captures everything about a decision problem: where you can be (states), what you can do (actions), what happens when you act (transitions), and what you get (rewards). Solve the MDP, and you know the optimal thing to do in every situation.
A Robot in a Grid

Click any empty cell to place the robot, then watch it navigate toward the goal while avoiding the pit. The robot follows the optimal policy.

Click a cell to start
Check: What makes a decision problem an MDP?

Chapter 1: States & Actions

An MDP starts with two ingredients. States are all the situations the agent can be in. In our grid world, each cell is a state. Actions are what the agent can do from each state. In the grid, the actions are: up, down, left, right.

The classic grid world is a 4×3 grid with 11 accessible cells (one cell is a wall). The agent starts somewhere and must reach the goal cell at (3,2) while avoiding the pit at (3,1). The wall at (1,1) is impassable.

States S
S = {(0,0), (1,0), (2,0), ... (3,2)} — 11 cells
Actions A
A = {Up, Down, Left, Right}
Terminal States
Goal (+1) and Pit (-1) end the episode
Markov property: The future depends only on the current state, not on how you got there. This is why it's called a Markov decision process. Your position in the grid tells you everything you need to decide your next move.
The Grid World

Hover over cells to see their coordinates. The dark cell is the wall. Green = goal. Red = pit.

Hover over a cell
SymbolMeaningGrid Example
SSet of all states11 accessible grid cells
ASet of all actions{Up, Down, Left, Right}
s0Start stateAny non-terminal cell
S+Terminal statesGoal (3,2) and Pit (3,1)
Check: What does the Markov property guarantee?

Chapter 2: Transitions — Stochastic Actions

Here's where MDPs get interesting. When you tell the robot "go up," it doesn't always go up. Real robots have noisy actuators: wheels slip, joints wobble, wind blows. We model this with a transition function P(s'|s,a): the probability of ending up in state s' given that you're in state s and take action a.

In our grid world, the standard model is: 80% of the time, the action works as intended. 10% of the time, you slip left (perpendicular). 10% of the time, you slip right. If you'd hit a wall, you stay in place. This stochasticity is what makes planning hard — and what makes MDPs powerful.

P(s' | s, a) — e.g., P(cell above | current cell, Up) = 0.8
Why stochastic? If actions were deterministic, planning would be trivial — just find the shortest path. Stochastic transitions force the agent to be robust: it must plan for the possibility that things go wrong. A good policy avoids danger even when it could slip toward it.
The transition tensor, concretely: The transition function T is a 3D array indexed by [current_state, action, next_state]. For our 4×3 grid with 11 states and 4 actions: T is [11, 4, 11] = 484 entries. Each slice T[s, a, :] is a probability distribution over next states (sums to 1.0). For a robot with 1000 discretized states and 8 actions: T is [1000, 8, 1000] = 8 million entries. And that's just a coarse discretization — a continuous-state robot needs function approximation because you can't store the full tensor.
Python
# Transition tensor for our 4x3 grid world
import numpy as np
T = np.zeros((11, 4, 11))   # [S, A, S'] = 484 entries

# Example: from state 0, action=Up (noise=0.1)
T[0, UP, 4] = 0.8    # intended: move to (0,1)
T[0, UP, 0] = 0.1    # slip left: hit west wall, stay
T[0, UP, 1] = 0.1    # slip right: move to (1,0)

# Every row sums to 1.0
assert np.allclose(T[0, UP, :].sum(), 1.0)
Transition Probabilities

Click a cell, then click an action. Watch the probability fan: 80% intended, 10% slip left, 10% slip right. Adjust the noise to see how stochasticity changes things.

Intended prob.0.80
Boundary rule: If an action would move the agent off the grid or into a wall, the agent stays in its current cell. The probability mass that would go off-grid gets "reflected" back to the current state. This is a common convention in grid-world MDPs.

Reading a transition row

Let's trace one concrete transition. The agent is at state (2,0), which is the bottom row, third cell from the left. It chooses action Up with noise = 0.1:

Transition trace
# Agent at (2,0), action = Up, noise = 10%

# Intended (80%): move up to (2,1)
T[(2,0), Up, (2,1)] = 0.80

# Slip left — perpendicular (10%): move left to (1,0)
# But (1,1) is a WALL! So we try (1,0) — that's valid.
T[(2,0), Up, (1,0)] = 0.10

# Slip right — perpendicular (10%): move right to (3,0)
T[(2,0), Up, (3,0)] = 0.10

# Sum: 0.80 + 0.10 + 0.10 = 1.00 ✓
# Every T[s, a, :] row must sum to exactly 1.0
Check: In the standard grid world, if you choose action "Up", what is the probability you actually move up?

Chapter 3: Rewards — What Drives Behavior

The reward function R(s) tells the agent what's good and what's bad. In our grid world: the goal cell gives +1, the pit gives -1, and every other step gives -0.04. That small negative reward for each step is crucial — it encourages the agent to reach the goal quickly rather than wandering forever.

The step penalty is a design choice with deep consequences. Make it too small, and the agent dawdles. Make it too large (close to -1), and the agent might prefer to jump into the pit immediately rather than risk a long journey. The reward function shapes behavior.

R(s) = +1 (goal) | -1 (pit) | -0.04 (all other cells)
Reward shaping: The step cost of -0.04 seems arbitrary, but it encodes a preference: "reach the goal soon." With R = 0 for non-terminal states, the agent wouldn't care about taking 5 steps vs 500. With R = -2, the agent might prefer the pit over a long path. The right reward function is a design art.
Reward Landscape

Adjust the step cost and watch how the reward landscape changes. The color encodes reward intensity: green = positive, red = negative.

Step cost-0.04
R(s) = -0.04: Mild urgency. Agent takes a safe but efficient route to the goal. The classic setting.
R(s) = -0.5: Extreme urgency. Agent rushes recklessly, risking the pit because lingering is too painful.
Check: Why do we add a small negative reward for each step?

Chapter 4: Returns & Discounting

An agent doesn't just care about the next reward — it cares about the total reward it will collect over its entire journey. We call this the return G. But should a reward 100 steps from now count the same as a reward right now?

Usually not. We introduce a discount factor γ (gamma), between 0 and 1, that shrinks future rewards. A reward k steps away is multiplied by γk. With γ = 0.9, a reward 10 steps away is worth only 0.910 ≈ 0.35 of its face value.

Gt = Rt+1 + γ Rt+2 + γ² Rt+3 + ... = ∑k=0 γk Rt+k+1
γ controls patience. γ near 1 = far-sighted agent that plans far ahead. γ near 0 = myopic agent that grabs the nearest reward. γ = 1 can cause infinite returns in non-terminating tasks. γ = 0 means the agent only cares about the very next reward.
Discounting Explorer

A fixed reward of +1 at each future step. Watch how γ determines what the agent "sees" into the future. Bars show the discounted value at each step.

Discount γ0.90
Total return G = 10.00
Why discount? Three reasons: (1) Mathematical convenience — the sum converges. (2) Uncertainty — the farther we look, the less sure we are. (3) Preference — most agents (and humans!) prefer sooner rewards to later ones.
Discount = effective horizon. There's a handy rule of thumb: the effective planning horizon is roughly 1/(1−γ) steps. With γ = 0.9, the agent effectively plans about 10 steps ahead. With γ = 0.99, it plans 100 steps. With γ = 0.999, it plans 1000 steps. And γ = 1.0 means infinite horizon — which may not converge for tasks that never terminate. This is why choosing γ is really choosing how far ahead your agent thinks.
γEffective HorizonBehavior
0.5~2 stepsExtremely myopic, grabs nearest reward
0.9~10 stepsShort-range planner
0.99~100 stepsMedium-range, good for most grid worlds
0.999~1000 stepsLong-range, needed for complex tasks
1.0May diverge for continuing tasks
Check: With γ = 0.5, how much is a +1 reward worth if it's 3 steps away?

Chapter 5: Value Functions — How Good Is a State?

The value function V(s) answers a fundamental question: "if I'm in state s and I act optimally from here on, what total (discounted) return can I expect?" A high V(s) means the state is desirable. A low V(s) means it's dangerous or far from the goal.

The value function satisfies a beautiful recursive relationship called the Bellman equation. The value of a state equals the best action's immediate reward plus the discounted expected value of the next state. This is the foundation of all MDP solving algorithms.

V*(s) = maxas' P(s'|s,a) [ R(s,a,s') + γ V*(s') ]
The Bellman equation is recursive: The value of a state depends on the values of its neighbors, which depend on the values of their neighbors, all the way to the terminal states. It's like asking "how far am I from the exit?" — the answer depends on how far your neighbors are.

A Worked Bellman Backup

Let's compute V* for cell (2,1) by hand. Assume γ = 0.9, step cost = −0.04, noise = 0.1, and that the neighboring values have already converged: V*(2,2) = 0.78, V*(3,1) = −1.0 (pit!), V*(1,0) = 0.40, V*(2,0) = 0.46.

Bellman backup for (2,1)
# Action = Up → to (2,2) with prob 0.8
Q(Up)  = 0.8 * (-0.04 + 0.9 * 0.78)   # intended: (2,2)
       + 0.1 * (-0.04 + 0.9 * V(1,1))  # slip left: wall → stay
       + 0.1 * (-0.04 + 0.9 * (-1.0))  # slip right: PIT!
       = 0.8 * 0.662 + 0.1 * V_stay + 0.1 * (-0.94)

# Action = Right → toward (3,1) which is the PIT
Q(Right) = 0.8 * (-0.04 + 0.9 * (-1.0)) + ...
         # Very negative! Avoid this action.

# V*(2,1) = max over all 4 actions
# The best action is Up → V*(2,1) ≈ 0.53

Notice how the pit at (3,1) drags down the value of action Right. The optimal policy will steer away from the pit, even though going right is geometrically closer to the goal. This is the Bellman equation at work: it accounts for risk.

Bellman Backup Visualizer

Click a cell to see how its value is computed from its neighbors. The Bellman backup "looks ahead" one step in each direction, weighted by transition probabilities.

Click any non-terminal cell
Goal Cell
V*(goal) = +1 (terminal, no future rewards)
↓ propagates outward
Nearby Cells
V*(near goal) ≈ γ × 1 − 0.04 ≈ 0.86
↓ propagates further
Far Cells
V* decreases with distance from goal
Check: What does the Bellman equation say about V*(s)?
🔨 Derivation Derive the Bellman Optimality Equation ✓ ATTEMPTED

We defined the return as Gt = Rt+1 + γRt+2 + γ²Rt+3 + ... and the value function as Vπ(s) = Eπ[Gt | St = s]. An optimal agent picks the best action in every state.

Your task: Starting from the definition of V*(s) = maxπ Vπ(s), derive the recursive Bellman optimality equation: V*(s) = maxas' P(s'|s,a) [R(s,a,s') + γ V*(s')].

An optimal agent picks the action that maximizes expected return. So V*(s) = maxa E[Gt | St = s, At = a]. Now split Gt into the immediate reward plus future returns.
Gt = Rt+1 + γGt+1. So V*(s) = maxa E[Rt+1 + γGt+1 | St = s, At = a]. The expectation is over the stochastic transition P(s'|s,a).
If the agent acts optimally from s' onward, then E[Gt+1 | St+1 = s'] = V*(s'). This is the key recursive step — the future is itself an optimal value problem.

Full derivation:

1. Start with V*(s) = maxπ Eπ[Gt | St = s]. An optimal policy must pick the best action at each step, so:
V*(s) = maxa E[Gt | St = s, At = a]

2. Decompose the return: Gt = Rt+1 + γGt+1. Substituting:
V*(s) = maxa E[Rt+1 + γGt+1 | St = s, At = a]

3. Expand the expectation over the transition distribution P(s'|s,a):
V*(s) = maxas' P(s'|s,a) [R(s,a,s') + γ E[Gt+1 | St+1 = s']]

4. Under the optimal policy, E[Gt+1 | St+1 = s'] = V*(s'). Therefore:
V*(s) = maxas' P(s'|s,a) [R(s,a,s') + γ V*(s')]

The key insight: The derivation relies on the principle of optimality: an optimal policy from s must also be optimal from whatever s' it transitions to. This allows us to replace E[Gt+1] with V*(s'), making the equation self-referential (recursive). The "max" outside the sum means we pick the action BEFORE the stochastic transition resolves.

Checkpoint — Before you move on
In your own words: why is the Bellman equation recursive, and why does that recursion eventually "bottom out" (terminate)? What prevents infinite regress?
✓ Gate cleared
Model Answer

The recursion terminates because of two anchors: (1) Terminal states have fixed values (V*(goal) = +1, V*(pit) = -1) with no further transitions — they're the base cases. (2) The discount factor γ < 1 ensures that even in non-terminating tasks, the contribution of future states shrinks geometrically, so the infinite sum converges. Together, these mean the Bellman equation is a system of |S| simultaneous equations with |S| unknowns, and it has a unique solution (provable via the Banach fixed-point theorem). Without terminal states AND with γ = 1, the recursion genuinely doesn't resolve — values can be infinite or undefined.

Chapter 6: Value Iteration — The Algorithm

We know the Bellman equation defines the optimal values, but how do we compute them? Value iteration is the workhorse algorithm. Start by initializing all values to zero (except terminals). Then, repeatedly apply the Bellman equation to every state: update each cell's value based on its neighbors' values. Repeat until the values converge.

Watch the magic below: values start at zero everywhere, then gradually propagate outward from the goal. Each iteration, the "wave" of positive values reaches one cell further. After convergence, every cell knows its optimal expected return.

This is the showcase. Click "Step" to run one iteration at a time and watch values fill the grid. Or click "Run" to animate the full convergence. Try different γ values to see how the discount factor changes the value landscape.
Value Iteration — Interactive Grid World
Iteration: 0 | Max Δ: —
Discount γ0.90
Step cost-0.04
Noise0.10

The Algorithm

Python
def value_iteration(S, A, P, R, gamma, theta=1e-6):
    V = {s: 0 for s in S}      # initialize all values to 0
    while True:
        delta = 0
        for s in S:
            if is_terminal(s): continue
            v = V[s]
            V[s] = max(                # best action
                sum(P(s,a,s2) * (R(s,a,s2) + gamma * V[s2])
                     for s2 in S)
                for a in A
            )
            delta = max(delta, abs(v - V[s]))
        if delta < theta:           # converged!
            break
    return V

Value Iteration as Matrix Operations

The inner loop can be expressed as a single matrix operation. For each action a, the expected value is R[:,a] + γ · T[s,a,:] @ V. The new V is the max across all actions. In NumPy:

Python
import numpy as np

# T shape: [|S|, |A|, |S|] — transition probabilities
# R shape: [|S|, |A|] — expected reward for (state, action)
# V shape: [|S|] — current value estimates

# One Bellman backup in vectorized NumPy:
Q = R + gamma * (T @ V)      # shape [|S|, |A|] — Q-values
V_new = np.max(Q, axis=1)   # shape [|S|] — best action per state

# For our 4x3 grid (11 states, 4 actions):
# T @ V: [11, 4, 11] @ [11] → [11, 4]  (expected next-state value)
# R + gamma * (T @ V): [11, 4]  (Q-values for each state-action)
# np.max(..., axis=1): [11]  (best Q per state = new V)
Policy iteration convergence: An alternative to value iteration is policy iteration, which alternates between evaluating a policy and improving it. It converges in at most |A||S| iterations (the total number of possible deterministic policies). For a 10-state, 4-action MDP: at most 410 = 1,048,576 iterations. In practice? Usually 3–5 iterations. Why so few? Each iteration strictly improves the policy (or proves it's optimal), and the jump to a better policy is usually large early on.
Check: What is the convergence criterion for value iteration?
🔨 Derivation Why Does Value Iteration Converge? (Contraction Mapping) ✓ ATTEMPTED

Value iteration applies the Bellman operator T repeatedly: Vk+1 = T[Vk], where T[V](s) = maxas' P(s'|s,a)[R + γV(s')]. We claim this converges to V* regardless of initialization.

Your task: Show that T is a γ-contraction in the max-norm: ||T[V] − T[U]|| ≤ γ ||V − U||. Then explain why this guarantees convergence.

||V − U|| = maxs |V(s) − U(s)|. It's the biggest disagreement between two value functions at any single state. You need to show that after one Bellman backup, the biggest disagreement shrinks by a factor of γ.
Key inequality: |maxa f(a) − maxa g(a)| ≤ maxa |f(a) − g(a)|. The maximizers might differ, but the difference in maxima is bounded by the worst-case difference. Apply this to T[V](s) − T[U](s).
For a fixed action a: |∑s' P(s'|s,a)[R + γV(s')] − ∑s' P(s'|s,a)[R + γU(s')]| = γ |∑s' P(s'|s,a)[V(s') − U(s')]|. The R terms cancel. The probability-weighted sum of differences is at most ||V − U|| (since probabilities sum to 1).

Full derivation:

1. For any state s:
|T[V](s) − T[U](s)| = |maxas' P(s'|s,a)[R + γV(s')] − maxas' P(s'|s,a)[R + γU(s')]|

2. Apply |max f − max g| ≤ max |f − g|:
≤ maxa |∑s' P(s'|s,a)[R + γV(s') − R − γU(s')]|
= maxa γ |∑s' P(s'|s,a)[V(s') − U(s')]|

3. Triangle inequality + probabilities sum to 1:
≤ maxa γ ∑s' P(s'|s,a) |V(s') − U(s')|
≤ γ · ||V − U|| · ∑s' P(s'|s,a)
= γ · ||V − U||

4. Since this holds for all s: ||T[V] − T[U]|| ≤ γ ||V − U||

5. By the Banach fixed-point theorem, any γ-contraction on a complete metric space has a unique fixed point, and repeated application converges to it at rate O(γk).

The key insight: The discount factor γ < 1 is doing double duty: it makes returns finite AND it makes the Bellman operator a contraction. With γ = 1, the operator is merely non-expansive — convergence is not guaranteed (and indeed fails for cyclic MDPs with no terminal states).

🔨 Derivation Why γ < 1 Makes Returns Finite ✓ ATTEMPTED

The return is Gt = ∑k=0 γk Rt+k+1. If rewards are bounded (|R| ≤ Rmax), show that Gt is always finite when γ < 1, and compute its maximum possible value.

Your task: Prove |Gt| ≤ Rmax / (1 − γ) using the geometric series.

Each term satisfies |γk Rt+k+1| ≤ γk Rmax. So |Gt| ≤ ∑k=0 γk Rmax = Rmaxk=0 γk.
k=0 γk = 1/(1 − γ) when |γ| < 1. This is the key: the infinite sum converges to a finite value.

Full derivation:

|Gt| = |∑k=0 γk Rt+k+1| ≤ ∑k=0 γk |Rt+k+1| ≤ ∑k=0 γk Rmax = Rmax / (1 − γ)

For our grid world: Rmax = 1, γ = 0.9, so |Gt| ≤ 1/(1−0.9) = 10. No state value can exceed 10 or go below -10. With γ = 0.99: |Gt| ≤ 100.

With γ = 1: ∑k=0 1k Rmax = ∞. The return is unbounded. For a cyclic MDP that never terminates and gives constant reward, G = ∞ or −∞. The value function is undefined.

The key insight: γ < 1 is not just a "preference for sooner rewards" — it's a mathematical necessity that guarantees the entire MDP framework is well-defined. Without it, V*(s) might not exist.

💥 Break-It Lab What Dies When You Remove MDP Components? ✓ ATTEMPTED
A working value iteration on the 4×3 grid world with γ=0.9, step cost=-0.04, noise=0.1. Toggle components off to see specific failure modes.
Set γ = 1.0 (no discounting) OFF
Failure mode: In this terminating MDP it still converges (terminal states anchor values), but remove terminals and you get infinite values. Watch how values become much larger — the agent "sees" infinitely far. In cyclic non-terminating MDPs, value iteration oscillates forever.
Remove reward signal (R=0 everywhere) OFF
Failure mode: All values collapse to zero. The agent has no gradient to follow — every action is equally "good." The policy becomes arbitrary. This demonstrates that the reward function IS the specification of the task.
Random policy (no optimization) OFF
Failure mode: Values drop significantly because the agent no longer avoids the pit or steers toward the goal. The "max" in the Bellman equation is replaced by an average over actions. States near the pit become strongly negative because the random policy stumbles into it frequently.
💻 Build It Implement Value Iteration from Scratch ✓ ATTEMPTED
You've seen value iteration in pseudocode. Now implement it for a generic MDP specified as dictionaries. The function should handle any MDP, not just grid worlds.
signature def value_iteration(states, actions, transitions, rewards, gamma, theta=1e-6): """ Args: states: list of state identifiers actions: list of action identifiers transitions: dict (s, a, s') -> probability rewards: dict (s, a, s') -> reward gamma: discount factor (0 < gamma < 1) theta: convergence threshold Returns: V: dict s -> optimal value policy: dict s -> optimal action """
Test case
A 3-state MDP: states=[A, B, C], C is terminal with reward +1.
transitions: (A, right, B)=1.0, (B, right, C)=1.0
rewards: (B, right, C)=1.0, all others=0
gamma=0.9
Expected: V[A]=0.81 (0.9²), V[B]=0.9, V[C]=0 (terminal). Policy: A→right, B→right.
Terminal states should not be updated — their value is fixed (usually 0, with the reward earned on transition TO them). Skip terminals in the main loop. For the Bellman backup: sum over all possible s' reachable from (s, a), using transitions[(s,a,s')] as the probability and rewards[(s,a,s')] as the immediate reward.
python
def value_iteration(states, actions, transitions, rewards, gamma, theta=1e-6):
    V = {s: 0.0 for s in states}
    terminals = {s for s in states
                 if not any((s, a, sp) in transitions for a in actions for sp in states)}

    while True:
        delta = 0.0
        for s in states:
            if s in terminals:
                continue
            v_old = V[s]
            # Bellman backup: max over actions
            best = float('-inf')
            for a in actions:
                q = 0.0
                for sp in states:
                    p = transitions.get((s, a, sp), 0.0)
                    if p > 0:
                        r = rewards.get((s, a, sp), 0.0)
                        q += p * (r + gamma * V[sp])
                if q > best:
                    best = q
            V[s] = best
            delta = max(delta, abs(v_old - best))
        if delta < theta:
            break

    # Extract policy
    policy = {}
    for s in states:
        if s in terminals:
            continue
        best_a, best_val = None, float('-inf')
        for a in actions:
            q = sum(transitions.get((s,a,sp), 0) * (rewards.get((s,a,sp), 0) + gamma * V[sp])
                    for sp in states)
            if q > best_val:
                best_val, best_a = q, a
        policy[s] = best_a
    return V, policy
Bonus challenge: Modify this to return the number of iterations until convergence. How does the iteration count change as γ approaches 1? (Hint: convergence rate is O(γk), so higher γ = more iterations.)

Chapter 7: Policy Extraction — From Values to Arrows

Value iteration gives us V*(s) for every state. But the agent needs to know what to do, not just how good each state is. The optimal policy π*(s) tells the agent which action to take in each state. Extracting it from V* is simple: for each state, pick the action that maximizes the expected value of the next state.

π*(s) = argmaxas' P(s'|s,a) [ R(s,a,s') + γ V*(s') ]

Below, watch the policy arrows emerge after value iteration converges. Each arrow points in the direction of the best action. Near the pit, arrows steer away. Near the goal, arrows point toward it. The policy is a complete strategy — it tells the agent what to do in every situation.

Policy iteration is an alternative algorithm: start with an arbitrary policy, compute its value, then improve the policy, and repeat. It often converges in fewer iterations than value iteration, but each iteration is more expensive. Both converge to the same optimal policy.
Policy Extraction

Click "Converge Values" to run value iteration to completion, then "Show Policy" to extract and display the optimal policy arrows. Adjust γ to see how it affects the policy.

Discount γ0.90
Step cost-0.04
Observe: With a step cost of -0.04, the policy near the pit at (3,1) is cautious — it steers wide. But crank the step cost to -0.4 and the agent becomes reckless: it takes the shortest path even if it risks the pit. The optimal policy changes with the reward function.
Check: How do you extract the optimal policy from V*?
⚔ Adversarial: Value iteration converges, but your policy performs terribly in the actual environment. Training and test use the same MDP. What's wrong?
You run value iteration to convergence (delta < 10-6). You extract the greedy policy. You deploy it. The agent consistently fails to reach the goal, often falling into the pit. You verify: same grid, same transitions, same rewards. VI definitely converged. What happened?
🏗 Design Challenge You're the Architect: Design an MDP for a Trading Bot ✓ ATTEMPTED
Your quant fund wants to deploy an MDP-based trading agent. It manages a single-stock portfolio. Every minute, it observes the market and decides what fraction of its capital to allocate. You must define the MDP: states, actions, transitions, rewards, and discount factor.
Observation rate
1-minute bars (OHLCV)
Action latency
~200ms execution
Capital
$100K, transaction cost 0.1%
Risk limit
Max 5% drawdown/day
1. State space: What should the state encode? Raw price? Returns? Technical indicators? Portfolio position? How do you keep |S| tractable?
2. Action space: Continuous (fraction of capital) or discrete (buy/hold/sell at fixed sizes)? What are the trade-offs?
3. Reward: Raw P&L vs Sharpe ratio vs risk-adjusted return? What does each reward function incentivize, and what pathologies does it create?
4. Gamma: Should the agent be myopic (γ~0.5, good for scalping) or far-sighted (γ~0.99, good for trend-following)? How does this interact with non-stationarity of markets?

State: Most successful approaches use a feature vector, NOT raw price (non-stationary!). Typical: [position_fraction, recent_returns (5-20 bars), volatility_estimate, spread, time_of_day]. Discretized into bins or fed to a neural network for function approximation. |S| is effectively infinite → model-free RL or fitted Q-iteration.

Actions: Discrete actions (e.g., target_position in {-1, -0.5, 0, 0.5, 1} of capital) dominate because continuous actions make exploration harder. The 0.1% transaction cost naturally penalizes excessive trading.

Reward: Raw P&L creates risk-seeking behavior (agent bets big hoping for rare wins). The field uses differential Sharpe ratio — a reward at each step that approximates the contribution to the overall Sharpe ratio. This balances return vs volatility incrementally. Alternatively: reward = return − λ · |drawdown| with λ tuned to risk appetite.

Gamma: Typically 0.95–0.99, but the real issue is non-stationarity. Markets shift regimes. A policy trained on bull market data fails in a crash. Practitioners retrain frequently or use meta-learning. Some use γ = 0 (pure bandit) for HFT since the "next state" is essentially unpredictable at millisecond scales.

Critical pitfall: The Markov property is violated! Today's price depends on yesterday's momentum, sentiment, order flow history. You must encode enough history in the state to make it approximately Markov (hence LSTM/attention in state encoders).

🔗 Pattern Recognition
Recursive Bellman = Recursive Bayes
This Lesson (MDP)
V*(s) = maxas' P(s'|s,a) [R + γ V*(s')]
"Value of here = best immediate + discounted future"
Kalman Filter
k = x̂k|k-1 + Kk(zk − H x̂k|k-1)
"Estimate now = prediction + correction"Kalman Filter lesson

Both are recursive equations where the current answer depends on the next-step answer. The Bellman equation propagates value backward from the future. The Kalman filter propagates information forward from measurements. Both solve their respective problems by breaking an intractable global optimization into local one-step operations chained together.

Can you spot this pattern in other algorithms? (Hint: dynamic programming, backpropagation, message passing in graphical models all share this "local computation, global solution" structure.)

Chapter 8: From MDP to the Real World

The grid world is a toy, but MDPs are everywhere. Any problem with sequential decisions under uncertainty can be framed as an MDP. The concepts you've learned — states, actions, transitions, rewards, values, and policies — are the language of decision-making in AI.

DomainStatesActionsRewards
RoboticsJoint angles, positionsMotor torquesTask completion
Game AIBoard/screen configMoves, buttonsScore, win/loss
LLM TokensGenerated text so farNext token choiceHuman preference
FinancePortfolio stateBuy/sell/holdReturns
DialogueConversation historyResponse choicesUser satisfaction
LLMs as MDPs: When a language model generates text, each token is an "action." The "state" is the text generated so far. The "reward" comes from human feedback (RLHF). The entire fine-tuning pipeline — PPO, DPO, GRPO — treats token generation as an MDP.

Limitations of MDPs

MDPs assume the agent can fully observe its state. In the real world, you often can't. A robot might not know its exact position; a poker player can't see the opponent's cards. This leads to Partially Observable MDPs (POMDPs), which add a belief state over possible states.

MDPs also assume we know the transition function P and reward function R. When we don't, we enter the world of Reinforcement Learning (RL), where the agent must learn these through trial and error.

MDP
Full observability, known model
↓ relax observability
POMDP
Partial observability, belief states
↓ relax known model
Reinforcement Learning
Learn by interaction, no model given
Where from here? MDPs are the theoretical backbone of reinforcement learning. If you understand value iteration and policy extraction, you have the foundation for Q-learning, policy gradient methods, actor-critic algorithms, and everything that powers modern AI decision-making.
🔗 Pattern Recognition
Known Model → Unknown Model: MDP to RL
MDP (this lesson)
V*(s) = maxas' P(s'|s,a) [R + γ V*(s')]
We know P. We can compute the sum exactly.
RL (model-free)
Q(s,a) ← Q(s,a) + α[R + γ maxa' Q(s',a') − Q(s,a)]
We DON'T know P. We sample transitions by acting.RL Algorithms lesson

Q-learning replaces the known-model sum (∑s' P · [...]) with a sample from real experience. Instead of computing the expectation analytically, we observe one (s, a, r, s') tuple and nudge our Q-value toward it. The Bellman equation becomes a stochastic approximation. Same fixed point, different path to get there.

What about POMDPs? When the agent can't observe s directly, it maintains a belief distribution over states — the "belief MDP" where states are probability distributions. How does this connect to the Bayes filter? (Hint: the belief update IS a Bayes filter.)

⚔ Adversarial: You model a robot navigation task as an MDP and solve it perfectly. The robot fails catastrophically in the real world. States, actions, and rewards match exactly. Transitions are deterministic. What went wrong?
The robot has a map. In simulation (your MDP), it always knows its exact (x,y) position. In reality, its GPS has 2m error. The "same MDP" gives the optimal policy. Yet the real robot drives off a cliff that's 1.5m from the correct path.
"The purpose of computing is insight, not numbers."
— Richard Hamming

You now understand how to frame and solve decision problems. Every game, every robot, every language model is making decisions — and now you know the math behind them.

Check: What's the key difference between an MDP and a POMDP?