Kochenderfer, Wheeler & Wray, Chapter 25

Sequential Problems

Games that unfold over time: agents observe state, choose actions, receive rewards, and transition. Markov games are what you get when multiple agents share an MDP — and Nash equilibrium has to learn Q-values.

Prerequisites: MDPs (Ch 7) + Ch 24 game theory basics. Nash equilibrium must be familiar.
10
Chapters
5
Simulations
10
Quizzes

Chapter 0: From MDP to Game

You're a robot navigating a warehouse. You've modeled this as an MDP: state (location), action (move), reward (item collected), transition (where you end up). Solved. Optimal policy computed.

Now a second robot enters the warehouse. Same goal. Your paths cross. If you both try to take the same item, conflict. If you coordinate, you both do better. The MDP framework has no model of this — it assumes the environment is fixed, not that it contains another decision-maker.

A Markov game (also called a stochastic game) generalizes the MDP to multiple agents. The world has state, and agents observe and act in it — but now the transition and reward depend on the joint action of all agents.

The critical difference from a normal form game: In Ch 24's normal form games, there was one round. Pick actions, get rewards, done. In a Markov game, the interaction repeats across time. The state changes. Agents accumulate discounted rewards. This means the strategic calculation is not just "what should I do now" but "what policy should I commit to for the long run" — exactly the MDP question, but for multiple agents simultaneously.
MDP vs. Markov Game Structure

See how the formalism expands from single-agent to multi-agent. Every component that was about one agent now becomes a joint tuple.

Why sequential matters for equilibrium: In a one-shot game, Nash equilibrium is defined over action profiles. In a sequential game, the "action" at each state is a policy — and the Nash equilibrium is over policies. The value of being in a state depends on the Nash equilibrium strategies in all future states. This creates a recursive Bellman-like structure.
What does a Markov game add to a normal form game?

Chapter 1: Markov Games

A Markov game (Shapley, 1953) for n agents is a tuple (I, S, A1, …, An, T, R1, …, Rn, γ):

ComponentNotationMeaning
AgentsI = {1, …, n}The players
State spaceSObservable world state
Action setsAiActions available to each agent
TransitionT(s'|s, a)Next state distribution given joint action a
Reward functionsRi(s, a, s')Reward for agent i at each step
Discountγ ∈ [0, 1)Time-preference for rewards

At each step: agents observe state s, simultaneously choose actions ai ~ πi(s), world transitions to s' ~ T(·|s,a), each agent receives reward Ri(s,a,s'). Repeat. Each agent seeks to maximize their own expected discounted sum ∑t γt Ri,t.

Special cases recover familiar models: If n=1: reduces to a standard MDP. If all Ri = R (shared reward) and agents can communicate: reduces to a cooperative MDP. If R1 = −R2 (zero-sum): reduces to a two-player zero-sum stochastic game, which Shapley proved has a unique Nash equilibrium computable by minimax value iteration.

Policies in Markov Games

Agent i's policy πi : S → Δ(Ai) maps states to action distributions. A joint policy π = (π1, …, πn) determines all agents' behavior. The value of a joint policy from state s for agent i is:

Viπ(s) = Eπ[∑t=0 γt Ri,t | s0=s]

Stage Games

At each state s, the agents face a stage game — a normal form game with payoffs determined by V. This is the key decomposition: a Markov game is a collection of state-indexed stage games, connected by the transition dynamics.

Minimax theorem for zero-sum Markov games: For 2-player zero-sum Markov games, minimax value iteration converges to the unique Nash equilibrium value function. At each state, solve the stage game minimax problem; the solution gives the Nash values, which become the continuation values for Bellman backup. This extends zero-sum LP methods to sequential settings.
How does the transition function T(s'|s,a) in a Markov game differ from an MDP transition T(s'|s,a)?

Chapter 2: Response Models (Sequential)

In sequential settings, a response model is no longer just a distribution over actions — it's a model of the opponent's policy πj : S → Δ(Aj). If I know your policy, I can solve my own MDP optimally.

The simplest model: assume opponents play a fixed stationary policy (same distribution over actions at each state, regardless of history). With this assumption, the Markov game from agent i's perspective reduces to an MDP with modified transition:

T'(s'|s, ai) = ∑a−ij ≠ i πj(aj|s) · T(s'|s, (ai, a−i))

This averaged transition treats other agents as part of the stochastic environment. Agent i can then run any standard RL algorithm against this effective MDP.

The non-stationarity problem: If all agents simultaneously update their policies, the "effective MDP" seen by each agent changes as the others' policies change. This violates the stationarity assumption underlying standard RL convergence proofs. Multi-agent RL is fundamentally non-stationary from each agent's perspective — a core difficulty of the field.
Fixed-Policy Response in a 2-Agent Grid World

Agent 2 (orange) follows a fixed policy (moves right). Watch Agent 1 (teal) solve its induced MDP and find the best response. Drag the rationality slider to see softmax vs. greedy behavior.

λ (rationality) 2.0
If agent i assumes agent j plays a fixed stationary policy, what does the Markov game reduce to for agent i?

Chapter 3: Nash Equilibrium for Markov Games

A joint policy π* = (π1*, …, πn*) is a Nash equilibrium for a Markov game if, at every state s, no agent can improve their value by unilaterally changing their policy, given that all other agents play π−i*:

Viπ*(s) ≥ Vii, π−i*)(s)   ∀ i, s, πi

The Nash Q-function for agent i is the expected discounted reward from taking action ai in state s, given the joint Nash policy thereafter:

QiNash(s, ai) = ∑a−ij≠i πj*(aj|s) ċ [Ri(s,a) + γ ∑s' T(s'|s,a) ViNash(s')]

And the Nash value is: ViNash(s) = NashVali(Q1Nash(s,·), …, QnNash(s,·)) — the value agent i gets in the Nash equilibrium of the stage game at s.

The Bellman equation for Markov games: The Nash Q-function satisfies a recursive equation — a generalization of the Bellman optimality equation for MDPs. The key difference: the backup step at each state requires solving a stage game (finding the Nash equilibrium of the current payoff matrix) rather than taking a simple max. For 2-player zero-sum games, this is an LP at each state. For general-sum, it's harder.
Initialize Qi(s,a)
Start with zeros or random values for all agents i, states s, joint actions a.
Stage Game Nash
At each state s, compute Qi(s,·) for all agents. Solve the n-player stage game to get π*(s) and NashVali(s).
Bellman Backup
Qi(s,a) ← Ri(s,a) + γ ∑s' T(s'|s,a) NashVali(s').
↻ repeat until convergence
Existence of Nash equilibrium in Markov games: Every finite Markov game has at least one Nash equilibrium in stationary policies. This extends Nash's theorem from normal form to sequential settings. The proof follows from the existence of Nash equilibria in the induced normal form game (over policy space), using the same fixed-point argument.
In the Nash Bellman backup for a Markov game, what replaces the simple argmax from single-agent value iteration?

Chapter 4: Nash Q-Learning

Minimax/Nash value iteration requires knowing T and R. In practice, we often don't. Nash Q-learning (Hu & Wellman, 2003) extends Q-learning to Markov games by replacing the max operator with a Nash equilibrium solver.

Each agent i maintains a Q-table Qi(s, a) indexed by state and joint action. After experiencing a transition (s, a, ri, s'), agent i updates:

Qi(s, a) ← (1−α) Qi(s, a) + α [ri + γ NashVali(s')]

where NashVali(s') is computed by solving the Nash equilibrium of the stage game at s' with payoff matrices Qi(s', ·) for each agent i. The Nash solution gives both the equilibrium strategy π*(s') and the equilibrium values.

What the agents must observe: Nash Q-learning requires each agent to observe the joint action a taken by all agents, and each agent's reward ri. This is the "joint action learner" assumption — agents can see what others did (though not their policies). Without joint action observability, Nash Q-learning breaks down: agents can't index Q by a.
python
import numpy as np

class NashQLearner:
    def __init__(self, n_agents, n_states, n_actions, alpha=0.1, gamma=0.95):
        self.n = n_agents; self.alpha = alpha; self.gamma = gamma
        # Q[i][s, a1, a2, ...] — joint action Q-table for agent i
        shape = [n_states] + [n_actions]*n_agents
        self.Q = [np.zeros(shape) for _ in range(n_agents)]

    def update(self, s, joint_a, rewards, sp):
        nash_vals = self.nash_values(sp)  # solve stage game at sp
        for i in range(self.n):
            old = self.Q[i][s][tuple(joint_a)]
            target = rewards[i] + self.gamma * nash_vals[i]
            self.Q[i][s][tuple(joint_a)] = (1-self.alpha)*old + self.alpha*target

    def nash_values(self, s):
        # For 2-agent zero-sum: minimax LP. For general-sum: Nash solver.
        # Here: simplified for 2-agent 2-action zero-sum case
        U1 = self.Q[0][s]  # 2x2 payoff matrix for agent 1
        U2 = self.Q[1][s]
        pi1, pi2 = solve_nash_2x2(U1, U2)
        v1 = (pi1 @ U1 @ pi2)
        v2 = (pi1 @ U2 @ pi2)
        return [v1, v2]
Nash Q-learning convergence conditions: Hu & Wellman (2003) proved convergence under conditions: (1) Q-values are bounded, (2) α decreases appropriately, (3) every (s, a) is visited infinitely often, AND (4) the Nash equilibrium at every state is unique and a "global optimum" (stronger than plain Nash). Condition 4 fails in many games, limiting practical convergence guarantees.
In Nash Q-learning, what replaces the max Q(s', a') from standard Q-learning?

Chapter 5: Fictitious Play for Markov Games

Fictitious play for Markov games extends the normal form algorithm from Ch 24 to sequential settings. Each agent maintains empirical counts of the opponents' actions at each state, best-responds to the historical frequency, and updates its own Q-function accordingly.

The state-indexed version: agent i maintains counts Cj(s, aj) for each opponent j — how many times j played aj at state s. The empirical frequency is πjemp(aj|s) = Cj(s,aj) / ∑a Cj(s,a). Agent i best-responds to this empirical policy by solving an MDP with modified transitions.

πjemp(aj|s) = Cj(s, aj) / ∑a'j Cj(s, a'j)
The key property preserved from normal form: Even when the current strategy cycles, the empirical average strategy converges. In zero-sum Markov games, the pair of empirical average policies converges to a Nash equilibrium of the game. This is the same time-averaging guarantee from Ch 24, extended to sequential settings via Q-value estimates.

Fictitious Play Update

Observe
(s, a−i, ri, s'). Increment Cj(s, aj) for each j ≠ i.
Update
Compute πjemp from counts. Solve best-response MDP. Update Qi.

Versus Nash Q-Learning

Nash Q-learning tries to reach Nash from the start. Fictitious play learns an approximate best-response to observed opponent play — simpler to implement (no Nash solver needed), but requires opponents to settle to a stationary policy for convergence.

Fictitious Play Convergence on Simple Game

Watch empirical frequencies converge over 200 rounds. The current strategy (bright dot) oscillates; the time-average (trail) converges to Nash (gold star).

Step 0
In fictitious play, what strategy does each agent best-respond to?

Chapter 6: Gradient Ascent for Markov Games

Gradient ascent from Ch 24 extends directly to Markov games. Each agent maintains a parametric policy πi(a|s; θi) and updates the parameters by ascending the gradient of their expected value:

θi ← θi + α ∇θi Viπ(s0)

For tabular policies (πi(a|s) directly as parameters), the gradient at each state reduces to the policy gradient theorem: the gradient is proportional to the advantage Ai(s, ai) = Qi(s, ai, π−i) − Vi(s). This is the same as REINFORCE from Ch 11, but with multi-agent Q-values.

Gradient ascent and policy gradient RL are the same thing: Policy gradient methods from Ch 11–13 are exactly gradient ascent applied to a single agent's utility in an environment that happens to contain other agents. The convergence issues of multi-agent gradient ascent are exactly the non-stationarity issues that plague multi-agent RL: as each agent updates, the environment seen by others changes, potentially undoing progress.
python
import numpy as np

def multiagent_gradient_ascent(game, pi_init, alpha=0.01, T=1000):
    """Simultaneous gradient ascent for tabular Markov game.
    game: MarkovGame object with Q-evaluation method.
    pi_init: list of (nS x nA) strategy matrices."""
    policies = [p.copy() for p in pi_init]
    history = []

    for t in range(T):
        # Evaluate all agents' Q-functions under current joint policy
        Q_vals = game.evaluate(policies)  # list of (nS x nA_i) arrays

        # Compute advantage: A_i(s, a_i) = Q_i(s, a_i) - V_i(s)
        for i in range(game.n_agents):
            V_i = (policies[i] * Q_vals[i]).sum(axis=1, keepdims=True)
            A_i = Q_vals[i] - V_i  # advantage: (nS x nA_i)

            # Gradient ascent on log-policy parameters
            policies[i] = policies[i] + alpha * A_i
            policies[i] = np.maximum(policies[i], 1e-10)
            policies[i] /= policies[i].sum(axis=1, keepdims=True)  # renorm

        history.append([p.copy() for p in policies])
    return history
WoLF (Win or Learn Fast): A stabilizing heuristic for gradient ascent in games: use a large step size αfast when losing (current value < Nash value) and small step size αslow when winning. Empirically, this reduces oscillation. Bowlings & Veloso (2002) proved WoLF converges in stationary environments and two-player two-action games. Not a universal convergence guarantee, but a useful practical trick.
In multi-agent gradient ascent, what is the key convergence challenge that doesn't exist in single-agent RL?

Chapter 7: SHOWCASE — Multi-Agent Grid Game

A 4×4 grid world. Agent 1 (teal) starts top-left, Agent 2 (orange) starts top-right. There is one coin (yellow star) in the grid. The agent who reaches it first gets +10; the other gets 0. Colliding costs both −2.

This is a general-sum Markov game: It's not zero-sum (both can avoid collision), not fully cooperative (only one gets the coin). The Nash equilibrium depends on starting positions and the coin location. Watch how different algorithms lead to different equilibrium behaviors.
SHOWCASE: 2-Agent Grid Game

Choose an algorithm and watch the agents learn. After training, click "Run Episode" to see the learned policies in action. Agent 1 = teal, Agent 2 = orange, Coin = gold.

Episode 0 Agent 1 reward: 0 Agent 2 reward: 0

Independent Q-Learning

Each agent runs standard Q-learning, ignoring the other. Simple but non-stationary: each agent's updates shift the environment for the other. Often converges in practice despite no theoretical guarantee.

Nash Q-Learning

Each agent maintains joint-action Q-tables and solves the stage game at each state to get Nash values. Heavier computation but principled. Converges under conditions from Hu & Wellman (2003).

What to observe in the showcase: Independent Q-learning often learns coordination (they avoid each other). Nash Q-learning explicitly reasons about the other agent. Fictitious play updates slowly based on frequency counts. All three can reach similar final behaviors in simple games — the differences emerge in speed and robustness to opponent changes.

Chapter 8: Convergence Analysis

None of the three algorithms (Nash Q-learning, fictitious play, gradient ascent) have universal convergence guarantees for general Markov games. Understanding when and why they converge — and when they don't — is crucial for applying them.

AlgorithmZero-SumCooperativeGeneral-SumKey Requirement
Nash Q-LearningYes ✓Yes ✓ConditionsUnique global Nash at each state
Minimax Q (zero-sum only)Yes ✓N/ANo2-player, zero-sum, tabular
Fictitious PlayYes ✓ (avg)Yes ✓NoStationary opponents, finite states
Gradient AscentYes ✓ (avg)Yes ✓NoDiminishing step sizes, convexity
Independent Q-LearningNoYes (sometimes)NoStationary effective environment
The cycling problem in Matching Pennies: In the Matching Pennies Markov game (same state, same payoff at every step), IBR cycles: Heads → Tails → Heads → … Nash Q-learning oscillates. Gradient ascent spirals. Only the time-averaged strategies converge to Nash. This is not an artifact — it's a fundamental property of the game: the only Nash equilibrium is in the interior of the simplex, and gradient dynamics necessarily orbit it.
Convergence Behavior Comparison

Plot of agent 1's probability of playing Action 0 over training. Solid line = current strategy, dashed = time average. The time average converges to Nash even when current policy oscillates.

No-regret learning and Nash in Markov games: If all agents run no-regret algorithms (e.g., EXP3 for adversarial bandits at each state), the empirical joint strategy converges to a correlated equilibrium of the Markov game. In zero-sum games, this is Nash. This is the strongest general convergence result available — no stronger guarantee is known without additional assumptions.
Why does time-averaging the strategies help convergence even when the current strategy oscillates?

Chapter 9: Connections & Beyond

Markov games are the right model when agents can observe the full state. In practice, this assumption often fails: robots don't have omniscient sensors, financial agents don't see each other's order books, teammates can't observe each other's locations. Chapter 26 extends to partial observability.

SettingModelSolved by
Full obs., competitiveZero-sum Markov gameMinimax Q, minimax VI
Full obs., general-sumMarkov gameNash Q, fictitious play, GA
Partial obs., competitivePartially obs. Markov gameCh 26 methods
Partial obs., cooperativeDec-POMDPCh 27 methods
Unknown modelMARL (model-free)Nash Q, independent Q
Multi-agent RL in practice: Modern large-scale MARL systems (AlphaStar, OpenAI Five, AlphaGo) don't use Nash Q-learning — the joint action space is too large. Instead, they use: (1) independent learners with population training (self-play), (2) policy gradient methods with centralized critics (MADDPG, MAPPO), or (3) population-based training with diversity incentives. The theory from this chapter informs but doesn't directly scale to these systems.

Self-play and Nash:

In zero-sum games (chess, Go, poker), self-play (training against your own past policy) provably converges to Nash equilibrium under certain conditions. This is the basis for AlphaZero-style systems. The key insight: self-play is iterated best-response, and in zero-sum games with unique Nash, IBR converges.

Centralized training, decentralized execution:

A common paradigm in cooperative MARL: during training, allow agents to share observations and Q-functions (centralized critic). During execution, each agent acts on its own observations. This resolves non-stationarity during training while maintaining decentralized execution. QMIX, MADDPG, MAPPO all use this pattern.

Mean field games: When n → ∞, strategic interaction with the full population becomes intractable. Mean field game theory (Lasry & Lions 2007) approximates each agent's interaction as interaction with an aggregate distribution (the "mean field") of all others. This reduces n-player Nash to a single-agent control problem against a fixed distribution. Applicable to traffic, financial markets, large-scale robotics.
"Reinforcement learning in multi-agent environments is fundamentally a game. The environment is non-stationary not because the world changes, but because other agents learn and adapt."
— Paraphrase, Shoham, Powers & Grenager, "If multi-agent learning is the answer, what is the question?" (2007)
What is the "centralized training, decentralized execution" paradigm in cooperative MARL?
← Ch 24: Multiagent Reasoning Index Ch 26: State Uncertainty →