The Complete Beginner's Path

Understand Markov
Decision Processes

The mathematical framework behind every decision-making AI — from game-playing agents to robotic navigation to how LLMs choose their next token.

Prerequisites: Basic probability + Grid intuition. That's it.
9
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Decisions?

Imagine you're a robot in a grid world. Each cell has a reward: some cells are gold (+1), some are pits (-1), and every step you take costs a little energy (-0.04). Your goal: find the best path from any starting cell to the gold, while avoiding the pit.

This sounds simple, but it gets tricky fast. What if your wheels slip? What if the grid is enormous? What if rewards are far away and you need to plan many steps ahead? You need a framework for thinking about sequential decisions under uncertainty. That framework is the Markov Decision Process (MDP).

The core idea: An MDP captures everything about a decision problem: where you can be (states), what you can do (actions), what happens when you act (transitions), and what you get (rewards). Solve the MDP, and you know the optimal thing to do in every situation.
A Robot in a Grid

Click any empty cell to place the robot, then watch it navigate toward the goal while avoiding the pit. The robot follows the optimal policy.

Click a cell to start
Check: What makes a decision problem an MDP?

Chapter 1: States & Actions

An MDP starts with two ingredients. States are all the situations the agent can be in. In our grid world, each cell is a state. Actions are what the agent can do from each state. In the grid, the actions are: up, down, left, right.

The classic grid world is a 4×3 grid with 11 accessible cells (one cell is a wall). The agent starts somewhere and must reach the goal cell at (3,2) while avoiding the pit at (3,1). The wall at (1,1) is impassable.

States S
S = {(0,0), (1,0), (2,0), ... (3,2)} — 11 cells
Actions A
A = {Up, Down, Left, Right}
Terminal States
Goal (+1) and Pit (-1) end the episode
Markov property: The future depends only on the current state, not on how you got there. This is why it's called a Markov decision process. Your position in the grid tells you everything you need to decide your next move.
The Grid World

Hover over cells to see their coordinates. The dark cell is the wall. Green = goal. Red = pit.

Hover over a cell
SymbolMeaningGrid Example
SSet of all states11 accessible grid cells
ASet of all actions{Up, Down, Left, Right}
s0Start stateAny non-terminal cell
S+Terminal statesGoal (3,2) and Pit (3,1)
Check: What does the Markov property guarantee?

Chapter 2: Transitions — Stochastic Actions

Here's where MDPs get interesting. When you tell the robot "go up," it doesn't always go up. Real robots have noisy actuators: wheels slip, joints wobble, wind blows. We model this with a transition function P(s'|s,a): the probability of ending up in state s' given that you're in state s and take action a.

In our grid world, the standard model is: 80% of the time, the action works as intended. 10% of the time, you slip left (perpendicular). 10% of the time, you slip right. If you'd hit a wall, you stay in place. This stochasticity is what makes planning hard — and what makes MDPs powerful.

P(s' | s, a) — e.g., P(cell above | current cell, Up) = 0.8
Why stochastic? If actions were deterministic, planning would be trivial — just find the shortest path. Stochastic transitions force the agent to be robust: it must plan for the possibility that things go wrong. A good policy avoids danger even when it could slip toward it.
Transition Probabilities

Click a cell, then click an action. Watch the probability fan: 80% intended, 10% slip left, 10% slip right. Adjust the noise to see how stochasticity changes things.

Intended prob.0.80
Boundary rule: If an action would move the agent off the grid or into a wall, the agent stays in its current cell. The probability mass that would go off-grid gets "reflected" back to the current state. This is a common convention in grid-world MDPs.
Check: In the standard grid world, if you choose action "Up", what is the probability you actually move up?

Chapter 3: Rewards — What Drives Behavior

The reward function R(s) tells the agent what's good and what's bad. In our grid world: the goal cell gives +1, the pit gives -1, and every other step gives -0.04. That small negative reward for each step is crucial — it encourages the agent to reach the goal quickly rather than wandering forever.

The step penalty is a design choice with deep consequences. Make it too small, and the agent dawdles. Make it too large (close to -1), and the agent might prefer to jump into the pit immediately rather than risk a long journey. The reward function shapes behavior.

R(s) = +1 (goal) | -1 (pit) | -0.04 (all other cells)
Reward shaping: The step cost of -0.04 seems arbitrary, but it encodes a preference: "reach the goal soon." With R = 0 for non-terminal states, the agent wouldn't care about taking 5 steps vs 500. With R = -2, the agent might prefer the pit over a long path. The right reward function is a design art.
Reward Landscape

Adjust the step cost and watch how the reward landscape changes. The color encodes reward intensity: green = positive, red = negative.

Step cost-0.04
R(s) = -0.04: Mild urgency. Agent takes a safe but efficient route to the goal. The classic setting.
R(s) = -0.5: Extreme urgency. Agent rushes recklessly, risking the pit because lingering is too painful.
Check: Why do we add a small negative reward for each step?

Chapter 4: Returns & Discounting

An agent doesn't just care about the next reward — it cares about the total reward it will collect over its entire journey. We call this the return G. But should a reward 100 steps from now count the same as a reward right now?

Usually not. We introduce a discount factor γ (gamma), between 0 and 1, that shrinks future rewards. A reward k steps away is multiplied by γk. With γ = 0.9, a reward 10 steps away is worth only 0.910 ≈ 0.35 of its face value.

Gt = Rt+1 + γ Rt+2 + γ² Rt+3 + ... = ∑k=0 γk Rt+k+1
γ controls patience. γ near 1 = far-sighted agent that plans far ahead. γ near 0 = myopic agent that grabs the nearest reward. γ = 1 can cause infinite returns in non-terminating tasks. γ = 0 means the agent only cares about the very next reward.
Discounting Explorer

A fixed reward of +1 at each future step. Watch how γ determines what the agent "sees" into the future. Bars show the discounted value at each step.

Discount γ0.90
Total return G = 10.00
Why discount? Three reasons: (1) Mathematical convenience — the sum converges. (2) Uncertainty — the farther we look, the less sure we are. (3) Preference — most agents (and humans!) prefer sooner rewards to later ones.
Check: With γ = 0.5, how much is a +1 reward worth if it's 3 steps away?

Chapter 5: Value Functions — How Good Is a State?

The value function V(s) answers a fundamental question: "if I'm in state s and I act optimally from here on, what total (discounted) return can I expect?" A high V(s) means the state is desirable. A low V(s) means it's dangerous or far from the goal.

The value function satisfies a beautiful recursive relationship called the Bellman equation. The value of a state equals the best action's immediate reward plus the discounted expected value of the next state. This is the foundation of all MDP solving algorithms.

V*(s) = maxas' P(s'|s,a) [ R(s,a,s') + γ V*(s') ]
The Bellman equation is recursive: The value of a state depends on the values of its neighbors, which depend on the values of their neighbors, all the way to the terminal states. It's like asking "how far am I from the exit?" — the answer depends on how far your neighbors are.
Bellman Backup Visualizer

Click a cell to see how its value is computed from its neighbors. The Bellman backup "looks ahead" one step in each direction, weighted by transition probabilities.

Click any non-terminal cell
Goal Cell
V*(goal) = +1 (terminal, no future rewards)
↓ propagates outward
Nearby Cells
V*(near goal) ≈ γ × 1 − 0.04 ≈ 0.86
↓ propagates further
Far Cells
V* decreases with distance from goal
Check: What does the Bellman equation say about V*(s)?

Chapter 6: Value Iteration — The Algorithm

We know the Bellman equation defines the optimal values, but how do we compute them? Value iteration is the workhorse algorithm. Start by initializing all values to zero (except terminals). Then, repeatedly apply the Bellman equation to every state: update each cell's value based on its neighbors' values. Repeat until the values converge.

Watch the magic below: values start at zero everywhere, then gradually propagate outward from the goal. Each iteration, the "wave" of positive values reaches one cell further. After convergence, every cell knows its optimal expected return.

This is the showcase. Click "Step" to run one iteration at a time and watch values fill the grid. Or click "Run" to animate the full convergence. Try different γ values to see how the discount factor changes the value landscape.
Value Iteration — Interactive Grid World
Iteration: 0 | Max Δ: —
Discount γ0.90
Step cost-0.04
Noise0.10

The Algorithm

Python
def value_iteration(S, A, P, R, gamma, theta=1e-6):
    V = {s: 0 for s in S}      # initialize all values to 0
    while True:
        delta = 0
        for s in S:
            if is_terminal(s): continue
            v = V[s]
            V[s] = max(                # best action
                sum(P(s,a,s2) * (R(s,a,s2) + gamma * V[s2])
                     for s2 in S)
                for a in A
            )
            delta = max(delta, abs(v - V[s]))
        if delta < theta:           # converged!
            break
    return V
Check: What is the convergence criterion for value iteration?

Chapter 7: Policy Extraction — From Values to Arrows

Value iteration gives us V*(s) for every state. But the agent needs to know what to do, not just how good each state is. The optimal policy π*(s) tells the agent which action to take in each state. Extracting it from V* is simple: for each state, pick the action that maximizes the expected value of the next state.

π*(s) = argmaxas' P(s'|s,a) [ R(s,a,s') + γ V*(s') ]

Below, watch the policy arrows emerge after value iteration converges. Each arrow points in the direction of the best action. Near the pit, arrows steer away. Near the goal, arrows point toward it. The policy is a complete strategy — it tells the agent what to do in every situation.

Policy iteration is an alternative algorithm: start with an arbitrary policy, compute its value, then improve the policy, and repeat. It often converges in fewer iterations than value iteration, but each iteration is more expensive. Both converge to the same optimal policy.
Policy Extraction

Click "Converge Values" to run value iteration to completion, then "Show Policy" to extract and display the optimal policy arrows. Adjust γ to see how it affects the policy.

Discount γ0.90
Step cost-0.04
Observe: With a step cost of -0.04, the policy near the pit at (3,1) is cautious — it steers wide. But crank the step cost to -0.4 and the agent becomes reckless: it takes the shortest path even if it risks the pit. The optimal policy changes with the reward function.
Check: How do you extract the optimal policy from V*?

Chapter 8: From MDP to the Real World

The grid world is a toy, but MDPs are everywhere. Any problem with sequential decisions under uncertainty can be framed as an MDP. The concepts you've learned — states, actions, transitions, rewards, values, and policies — are the language of decision-making in AI.

DomainStatesActionsRewards
RoboticsJoint angles, positionsMotor torquesTask completion
Game AIBoard/screen configMoves, buttonsScore, win/loss
LLM TokensGenerated text so farNext token choiceHuman preference
FinancePortfolio stateBuy/sell/holdReturns
DialogueConversation historyResponse choicesUser satisfaction
LLMs as MDPs: When a language model generates text, each token is an "action." The "state" is the text generated so far. The "reward" comes from human feedback (RLHF). The entire fine-tuning pipeline — PPO, DPO, GRPO — treats token generation as an MDP.

Limitations of MDPs

MDPs assume the agent can fully observe its state. In the real world, you often can't. A robot might not know its exact position; a poker player can't see the opponent's cards. This leads to Partially Observable MDPs (POMDPs), which add a belief state over possible states.

MDPs also assume we know the transition function P and reward function R. When we don't, we enter the world of Reinforcement Learning (RL), where the agent must learn these through trial and error.

MDP
Full observability, known model
↓ relax observability
POMDP
Partial observability, belief states
↓ relax known model
Reinforcement Learning
Learn by interaction, no model given
Where from here? MDPs are the theoretical backbone of reinforcement learning. If you understand value iteration and policy extraction, you have the foundation for Q-learning, policy gradient methods, actor-critic algorithms, and everything that powers modern AI decision-making.
"The purpose of computing is insight, not numbers."
— Richard Hamming

You now understand how to frame and solve decision problems. Every game, every robot, every language model is making decisions — and now you know the math behind them.

Check: What's the key difference between an MDP and a POMDP?