The mathematical framework behind every decision-making AI — from game-playing agents to robotic navigation to how LLMs choose their next token.
Imagine you're a robot in a grid world. Each cell has a reward: some cells are gold (+1), some are pits (-1), and every step you take costs a little energy (-0.04). Your goal: find the best path from any starting cell to the gold, while avoiding the pit.
This sounds simple, but it gets tricky fast. What if your wheels slip? What if the grid is enormous? What if rewards are far away and you need to plan many steps ahead? You need a framework for thinking about sequential decisions under uncertainty. That framework is the Markov Decision Process (MDP).
Click any empty cell to place the robot, then watch it navigate toward the goal while avoiding the pit. The robot follows the optimal policy.
An MDP starts with two ingredients. States are all the situations the agent can be in. In our grid world, each cell is a state. Actions are what the agent can do from each state. In the grid, the actions are: up, down, left, right.
The classic grid world is a 4×3 grid with 11 accessible cells (one cell is a wall). The agent starts somewhere and must reach the goal cell at (3,2) while avoiding the pit at (3,1). The wall at (1,1) is impassable.
Hover over cells to see their coordinates. The dark cell is the wall. Green = goal. Red = pit.
| Symbol | Meaning | Grid Example |
|---|---|---|
| S | Set of all states | 11 accessible grid cells |
| A | Set of all actions | {Up, Down, Left, Right} |
| s0 | Start state | Any non-terminal cell |
| S+ | Terminal states | Goal (3,2) and Pit (3,1) |
Here's where MDPs get interesting. When you tell the robot "go up," it doesn't always go up. Real robots have noisy actuators: wheels slip, joints wobble, wind blows. We model this with a transition function P(s'|s,a): the probability of ending up in state s' given that you're in state s and take action a.
In our grid world, the standard model is: 80% of the time, the action works as intended. 10% of the time, you slip left (perpendicular). 10% of the time, you slip right. If you'd hit a wall, you stay in place. This stochasticity is what makes planning hard — and what makes MDPs powerful.
Click a cell, then click an action. Watch the probability fan: 80% intended, 10% slip left, 10% slip right. Adjust the noise to see how stochasticity changes things.
The reward function R(s) tells the agent what's good and what's bad. In our grid world: the goal cell gives +1, the pit gives -1, and every other step gives -0.04. That small negative reward for each step is crucial — it encourages the agent to reach the goal quickly rather than wandering forever.
The step penalty is a design choice with deep consequences. Make it too small, and the agent dawdles. Make it too large (close to -1), and the agent might prefer to jump into the pit immediately rather than risk a long journey. The reward function shapes behavior.
Adjust the step cost and watch how the reward landscape changes. The color encodes reward intensity: green = positive, red = negative.
An agent doesn't just care about the next reward — it cares about the total reward it will collect over its entire journey. We call this the return G. But should a reward 100 steps from now count the same as a reward right now?
Usually not. We introduce a discount factor γ (gamma), between 0 and 1, that shrinks future rewards. A reward k steps away is multiplied by γk. With γ = 0.9, a reward 10 steps away is worth only 0.910 ≈ 0.35 of its face value.
A fixed reward of +1 at each future step. Watch how γ determines what the agent "sees" into the future. Bars show the discounted value at each step.
The value function V(s) answers a fundamental question: "if I'm in state s and I act optimally from here on, what total (discounted) return can I expect?" A high V(s) means the state is desirable. A low V(s) means it's dangerous or far from the goal.
The value function satisfies a beautiful recursive relationship called the Bellman equation. The value of a state equals the best action's immediate reward plus the discounted expected value of the next state. This is the foundation of all MDP solving algorithms.
Click a cell to see how its value is computed from its neighbors. The Bellman backup "looks ahead" one step in each direction, weighted by transition probabilities.
We know the Bellman equation defines the optimal values, but how do we compute them? Value iteration is the workhorse algorithm. Start by initializing all values to zero (except terminals). Then, repeatedly apply the Bellman equation to every state: update each cell's value based on its neighbors' values. Repeat until the values converge.
Watch the magic below: values start at zero everywhere, then gradually propagate outward from the goal. Each iteration, the "wave" of positive values reaches one cell further. After convergence, every cell knows its optimal expected return.
Python def value_iteration(S, A, P, R, gamma, theta=1e-6): V = {s: 0 for s in S} # initialize all values to 0 while True: delta = 0 for s in S: if is_terminal(s): continue v = V[s] V[s] = max( # best action sum(P(s,a,s2) * (R(s,a,s2) + gamma * V[s2]) for s2 in S) for a in A ) delta = max(delta, abs(v - V[s])) if delta < theta: # converged! break return V
Value iteration gives us V*(s) for every state. But the agent needs to know what to do, not just how good each state is. The optimal policy π*(s) tells the agent which action to take in each state. Extracting it from V* is simple: for each state, pick the action that maximizes the expected value of the next state.
Below, watch the policy arrows emerge after value iteration converges. Each arrow points in the direction of the best action. Near the pit, arrows steer away. Near the goal, arrows point toward it. The policy is a complete strategy — it tells the agent what to do in every situation.
Click "Converge Values" to run value iteration to completion, then "Show Policy" to extract and display the optimal policy arrows. Adjust γ to see how it affects the policy.
The grid world is a toy, but MDPs are everywhere. Any problem with sequential decisions under uncertainty can be framed as an MDP. The concepts you've learned — states, actions, transitions, rewards, values, and policies — are the language of decision-making in AI.
| Domain | States | Actions | Rewards |
|---|---|---|---|
| Robotics | Joint angles, positions | Motor torques | Task completion |
| Game AI | Board/screen config | Moves, buttons | Score, win/loss |
| LLM Tokens | Generated text so far | Next token choice | Human preference |
| Finance | Portfolio state | Buy/sell/hold | Returns |
| Dialogue | Conversation history | Response choices | User satisfaction |
MDPs assume the agent can fully observe its state. In the real world, you often can't. A robot might not know its exact position; a poker player can't see the opponent's cards. This leads to Partially Observable MDPs (POMDPs), which add a belief state over possible states.
MDPs also assume we know the transition function P and reward function R. When we don't, we enter the world of Reinforcement Learning (RL), where the agent must learn these through trial and error.
You now understand how to frame and solve decision problems. Every game, every robot, every language model is making decisions — and now you know the math behind them.