Ch 25: Sequential Problems — Algorithms for Decision Making

Chapter 0: From MDP to Game

You're a robot navigating a warehouse. You've modeled this as an MDP: state (location), action (move), reward (item collected), transition (where you end up). Solved. Optimal policy computed.

Now a second robot enters the warehouse. Same goal. Your paths cross. If you both try to take the same item, conflict. If you coordinate, you both do better. The MDP framework has no model of this — it assumes the environment is fixed, not that it contains another decision-maker.

A Markov game (also called a stochastic game) generalizes the MDP to multiple agents. The world has state, and agents observe and act in it — but now the transition and reward depend on the joint action of all agents.

The critical difference from a normal form game: In Ch 24's normal form games, there was one round. Pick actions, get rewards, done. In a Markov game, the interaction repeats across time. The state changes. Agents accumulate discounted rewards. This means the strategic calculation is not just "what should I do now" but "what policy should I commit to for the long run" — exactly the MDP question, but for multiple agents simultaneously.

MDP vs. Markov Game Structure

See how the formalism expands from single-agent to multi-agent. Every component that was about one agent now becomes a joint tuple.

Why sequential matters for equilibrium: In a one-shot game, Nash equilibrium is defined over action profiles. In a sequential game, the "action" at each state is a policy — and the Nash equilibrium is over policies. The value of being in a state depends on the Nash equilibrium strategies in all future states. This creates a recursive Bellman-like structure.

What does a Markov game add to a normal form game?

A state space and dynamics — the game repeats across time steps with transitions between states More players, allowing arbitrarily many agents Partial observability — agents can't see the full state

Chapter 1: Markov Games

A Markov game (Shapley, 1953) for n agents is a tuple (I, S, A₁, …, A_n, T, R₁, …, R_n, γ):

Component	Notation	Meaning
Agents	I = {1, …, n}	The players
State space	S	Observable world state
Action sets	A_i	Actions available to each agent
Transition	T(s'\|s, a)	Next state distribution given joint action a
Reward functions	R_i(s, a, s')	Reward for agent i at each step
Discount	γ ∈ [0, 1)	Time-preference for rewards

At each step: agents observe state s, simultaneously choose actions a_i ~ π_i(s), world transitions to s' ~ T(·|s,a), each agent receives reward R_i(s,a,s'). Repeat. Each agent seeks to maximize their own expected discounted sum ∑_t γ^t R_i,t.

Special cases recover familiar models: If n=1: reduces to a standard MDP. If all R_i = R (shared reward) and agents can communicate: reduces to a cooperative MDP. If R₁ = −R₂ (zero-sum): reduces to a two-player zero-sum stochastic game, which Shapley proved has a unique Nash equilibrium computable by minimax value iteration.

Policies in Markov Games

Agent i's policy π_i : S → Δ(A_i) maps states to action distributions. A joint policy π = (π₁, …, π_n) determines all agents' behavior. The value of a joint policy from state s for agent i is:

V_i^π(s) = E^π[∑_t=0^∞ γ^t R_i,t | s₀=s]

Stage Games

At each state s, the agents face a stage game — a normal form game with payoffs determined by V. This is the key decomposition: a Markov game is a collection of state-indexed stage games, connected by the transition dynamics.

Minimax theorem for zero-sum Markov games: For 2-player zero-sum Markov games, minimax value iteration converges to the unique Nash equilibrium value function. At each state, solve the stage game minimax problem; the solution gives the Nash values, which become the continuation values for Bellman backup. This extends zero-sum LP methods to sequential settings.

How does the transition function T(s'|s,a) in a Markov game differ from an MDP transition T(s'|s,a)?

In a Markov game, T conditions on the joint action a of all agents, not just one agent's action In a Markov game, T is non-stationary (changes over time) In a Markov game, T returns a deterministic next state

Chapter 2: Response Models (Sequential)

In sequential settings, a response model is no longer just a distribution over actions — it's a model of the opponent's policy π_j : S → Δ(A_j). If I know your policy, I can solve my own MDP optimally.

The simplest model: assume opponents play a fixed stationary policy (same distribution over actions at each state, regardless of history). With this assumption, the Markov game from agent i's perspective reduces to an MDP with modified transition:

T'(s'|s, a_i) = ∑_{a_−i} ∏_{j ≠ i} π_j(a_j|s) · T(s'|s, (a_i, a_−i))

This averaged transition treats other agents as part of the stochastic environment. Agent i can then run any standard RL algorithm against this effective MDP.

The non-stationarity problem: If all agents simultaneously update their policies, the "effective MDP" seen by each agent changes as the others' policies change. This violates the stationarity assumption underlying standard RL convergence proofs. Multi-agent RL is fundamentally non-stationary from each agent's perspective — a core difficulty of the field.

Fixed-Policy Response in a 2-Agent Grid World

Agent 2 (orange) follows a fixed policy (moves right). Watch Agent 1 (teal) solve its induced MDP and find the best response. Drag the rationality slider to see softmax vs. greedy behavior.

λ (rationality) 2.0

If agent i assumes agent j plays a fixed stationary policy, what does the Markov game reduce to for agent i?

A standard MDP with a modified transition that averages over j's policy A POMDP where agent i can't observe j's actions A normal form game (single-step interaction)

Chapter 3: Nash Equilibrium for Markov Games

A joint policy π* = (π₁*, …, π_n*) is a Nash equilibrium for a Markov game if, at every state s, no agent can improve their value by unilaterally changing their policy, given that all other agents play π_−i*:

V_i^π*(s) ≥ V_i^{(π_i, π_−i*)}(s) ∀ i, s, π_i

The Nash Q-function for agent i is the expected discounted reward from taking action a_i in state s, given the joint Nash policy thereafter:

Q_i^Nash(s, a_i) = ∑_{a_−i} ∏_j≠i π_j*(a_j|s) ċ [R_i(s,a) + γ ∑_s' T(s'|s,a) V_i^Nash(s')]

And the Nash value is: V_i^Nash(s) = NashVal_i(Q₁^Nash(s,·), …, Q_n^Nash(s,·)) — the value agent i gets in the Nash equilibrium of the stage game at s.

The Bellman equation for Markov games: The Nash Q-function satisfies a recursive equation — a generalization of the Bellman optimality equation for MDPs. The key difference: the backup step at each state requires solving a stage game (finding the Nash equilibrium of the current payoff matrix) rather than taking a simple max. For 2-player zero-sum games, this is an LP at each state. For general-sum, it's harder.

Initialize Q_i(s,a)

Start with zeros or random values for all agents i, states s, joint actions a.

↓

Stage Game Nash

At each state s, compute Q_i(s,·) for all agents. Solve the n-player stage game to get π*(s) and NashVal_i(s).

↓

Bellman Backup

Q_i(s,a) ← R_i(s,a) + γ ∑_s' T(s'|s,a) NashVal_i(s').

↻ repeat until convergence

Existence of Nash equilibrium in Markov games: Every finite Markov game has at least one Nash equilibrium in stationary policies. This extends Nash's theorem from normal form to sequential settings. The proof follows from the existence of Nash equilibria in the induced normal form game (over policy space), using the same fixed-point argument.

In the Nash Bellman backup for a Markov game, what replaces the simple argmax from single-agent value iteration?

Solving a stage game (finding the Nash equilibrium of the payoff matrix at each state) Taking the expectation over all actions with uniform probability Computing the maximum over the joint action space

Chapter 4: Nash Q-Learning

Minimax/Nash value iteration requires knowing T and R. In practice, we often don't. Nash Q-learning (Hu & Wellman, 2003) extends Q-learning to Markov games by replacing the max operator with a Nash equilibrium solver.

Each agent i maintains a Q-table Q_i(s, a) indexed by state and joint action. After experiencing a transition (s, a, r_i, s'), agent i updates:

Q_i(s, a) ← (1−α) Q_i(s, a) + α [r_i + γ NashVal_i(s')]

where NashVal_i(s') is computed by solving the Nash equilibrium of the stage game at s' with payoff matrices Q_i(s', ·) for each agent i. The Nash solution gives both the equilibrium strategy π*(s') and the equilibrium values.

What the agents must observe: Nash Q-learning requires each agent to observe the joint action a taken by all agents, and each agent's reward r_i. This is the "joint action learner" assumption — agents can see what others did (though not their policies). Without joint action observability, Nash Q-learning breaks down: agents can't index Q by a.

python
import numpy as np

class NashQLearner:
    def __init__(self, n_agents, n_states, n_actions, alpha=0.1, gamma=0.95):
        self.n = n_agents; self.alpha = alpha; self.gamma = gamma
        # Q[i][s, a1, a2, ...] — joint action Q-table for agent i
        shape = [n_states] + [n_actions]*n_agents
        self.Q = [np.zeros(shape) for _ in range(n_agents)]

    def update(self, s, joint_a, rewards, sp):
        nash_vals = self.nash_values(sp)  # solve stage game at sp
        for i in range(self.n):
            old = self.Q[i][s][tuple(joint_a)]
            target = rewards[i] + self.gamma * nash_vals[i]
            self.Q[i][s][tuple(joint_a)] = (1-self.alpha)*old + self.alpha*target

    def nash_values(self, s):
        # For 2-agent zero-sum: minimax LP. For general-sum: Nash solver.
        # Here: simplified for 2-agent 2-action zero-sum case
        U1 = self.Q[0][s]  # 2x2 payoff matrix for agent 1
        U2 = self.Q[1][s]
        pi1, pi2 = solve_nash_2x2(U1, U2)
        v1 = (pi1 @ U1 @ pi2)
        v2 = (pi1 @ U2 @ pi2)
        return [v1, v2]

Nash Q-learning convergence conditions: Hu & Wellman (2003) proved convergence under conditions: (1) Q-values are bounded, (2) α decreases appropriately, (3) every (s, a) is visited infinitely often, AND (4) the Nash equilibrium at every state is unique and a "global optimum" (stronger than plain Nash). Condition 4 fails in many games, limiting practical convergence guarantees.

In Nash Q-learning, what replaces the max Q(s', a') from standard Q-learning?

NashVal_i(s') — agent i's value in the Nash equilibrium of the stage game at s' The average Q-value over all actions at s' The sum of all agents' Q-values at s'

Chapter 5: Fictitious Play for Markov Games

Fictitious play for Markov games extends the normal form algorithm from Ch 24 to sequential settings. Each agent maintains empirical counts of the opponents' actions at each state, best-responds to the historical frequency, and updates its own Q-function accordingly.

The state-indexed version: agent i maintains counts C_j(s, a_j) for each opponent j — how many times j played a_j at state s. The empirical frequency is π_j^emp(a_j|s) = C_j(s,a_j) / ∑_a C_j(s,a). Agent i best-responds to this empirical policy by solving an MDP with modified transitions.

π_j^emp(a_j|s) = C_j(s, a_j) / ∑_{a'_j} C_j(s, a'_j)

The key property preserved from normal form: Even when the current strategy cycles, the empirical average strategy converges. In zero-sum Markov games, the pair of empirical average policies converges to a Nash equilibrium of the game. This is the same time-averaging guarantee from Ch 24, extended to sequential settings via Q-value estimates.

Fictitious Play Update

Observe

(s, a_−i, r_i, s'). Increment C_j(s, a_j) for each j ≠ i.

↓

Update

Compute π_j^emp from counts. Solve best-response MDP. Update Q_i.

Versus Nash Q-Learning

Nash Q-learning tries to reach Nash from the start. Fictitious play learns an approximate best-response to observed opponent play — simpler to implement (no Nash solver needed), but requires opponents to settle to a stationary policy for convergence.

Fictitious Play Convergence on Simple Game

Watch empirical frequencies converge over 200 rounds. The current strategy (bright dot) oscillates; the time-average (trail) converges to Nash (gold star).

Step 0

In fictitious play, what strategy does each agent best-respond to?

The empirical frequency of the opponent's past actions (historical average, not current policy) The Nash equilibrium strategy of the opponent The opponent's last observed action

Chapter 6: Gradient Ascent for Markov Games

Gradient ascent from Ch 24 extends directly to Markov games. Each agent maintains a parametric policy π_i(a|s; θ_i) and updates the parameters by ascending the gradient of their expected value:

θ_i ← θ_i + α ∇_{θ_i} V_i^π(s₀)

For tabular policies (π_i(a|s) directly as parameters), the gradient at each state reduces to the policy gradient theorem: the gradient is proportional to the advantage A_i(s, a_i) = Q_i(s, a_i, π_−i) − V_i(s). This is the same as REINFORCE from Ch 11, but with multi-agent Q-values.

Gradient ascent and policy gradient RL are the same thing: Policy gradient methods from Ch 11–13 are exactly gradient ascent applied to a single agent's utility in an environment that happens to contain other agents. The convergence issues of multi-agent gradient ascent are exactly the non-stationarity issues that plague multi-agent RL: as each agent updates, the environment seen by others changes, potentially undoing progress.

python
import numpy as np

def multiagent_gradient_ascent(game, pi_init, alpha=0.01, T=1000):
    """Simultaneous gradient ascent for tabular Markov game.
    game: MarkovGame object with Q-evaluation method.
    pi_init: list of (nS x nA) strategy matrices."""
    policies = [p.copy() for p in pi_init]
    history = []

    for t in range(T):
        # Evaluate all agents' Q-functions under current joint policy
        Q_vals = game.evaluate(policies)  # list of (nS x nA_i) arrays

        # Compute advantage: A_i(s, a_i) = Q_i(s, a_i) - V_i(s)
        for i in range(game.n_agents):
            V_i = (policies[i] * Q_vals[i]).sum(axis=1, keepdims=True)
            A_i = Q_vals[i] - V_i  # advantage: (nS x nA_i)

            # Gradient ascent on log-policy parameters
            policies[i] = policies[i] + alpha * A_i
            policies[i] = np.maximum(policies[i], 1e-10)
            policies[i] /= policies[i].sum(axis=1, keepdims=True)  # renorm

        history.append([p.copy() for p in policies])
    return history

WoLF (Win or Learn Fast): A stabilizing heuristic for gradient ascent in games: use a large step size α_fast when losing (current value < Nash value) and small step size α_slow when winning. Empirically, this reduces oscillation. Bowlings & Veloso (2002) proved WoLF converges in stationary environments and two-player two-action games. Not a universal convergence guarantee, but a useful practical trick.

In multi-agent gradient ascent, what is the key convergence challenge that doesn't exist in single-agent RL?

Non-stationarity: as each agent updates its policy, the effective environment seen by other agents changes, violating RL's stationarity assumption The state space grows exponentially with the number of agents Gradient computation requires knowing other agents' reward functions

Chapter 7: SHOWCASE — Multi-Agent Grid Game

A 4×4 grid world. Agent 1 (teal) starts top-left, Agent 2 (orange) starts top-right. There is one coin (yellow star) in the grid. The agent who reaches it first gets +10; the other gets 0. Colliding costs both −2.

This is a general-sum Markov game: It's not zero-sum (both can avoid collision), not fully cooperative (only one gets the coin). The Nash equilibrium depends on starting positions and the coin location. Watch how different algorithms lead to different equilibrium behaviors.

SHOWCASE: 2-Agent Grid Game

Choose an algorithm and watch the agents learn. After training, click "Run Episode" to see the learned policies in action. Agent 1 = teal, Agent 2 = orange, Coin = gold.

Episode 0 Agent 1 reward: 0 Agent 2 reward: 0

Independent Q-Learning

Each agent runs standard Q-learning, ignoring the other. Simple but non-stationary: each agent's updates shift the environment for the other. Often converges in practice despite no theoretical guarantee.

Nash Q-Learning

Each agent maintains joint-action Q-tables and solves the stage game at each state to get Nash values. Heavier computation but principled. Converges under conditions from Hu & Wellman (2003).

What to observe in the showcase: Independent Q-learning often learns coordination (they avoid each other). Nash Q-learning explicitly reasons about the other agent. Fictitious play updates slowly based on frequency counts. All three can reach similar final behaviors in simple games — the differences emerge in speed and robustness to opponent changes.

Chapter 8: Convergence Analysis

None of the three algorithms (Nash Q-learning, fictitious play, gradient ascent) have universal convergence guarantees for general Markov games. Understanding when and why they converge — and when they don't — is crucial for applying them.

Algorithm	Zero-Sum	Cooperative	General-Sum	Key Requirement
Nash Q-Learning	Yes ✓	Yes ✓	Conditions	Unique global Nash at each state
Minimax Q (zero-sum only)	Yes ✓	N/A	No	2-player, zero-sum, tabular
Fictitious Play	Yes ✓ (avg)	Yes ✓	No	Stationary opponents, finite states
Gradient Ascent	Yes ✓ (avg)	Yes ✓	No	Diminishing step sizes, convexity
Independent Q-Learning	No	Yes (sometimes)	No	Stationary effective environment

The cycling problem in Matching Pennies: In the Matching Pennies Markov game (same state, same payoff at every step), IBR cycles: Heads → Tails → Heads → … Nash Q-learning oscillates. Gradient ascent spirals. Only the time-averaged strategies converge to Nash. This is not an artifact — it's a fundamental property of the game: the only Nash equilibrium is in the interior of the simplex, and gradient dynamics necessarily orbit it.

Convergence Behavior Comparison

Plot of agent 1's probability of playing Action 0 over training. Solid line = current strategy, dashed = time average. The time average converges to Nash even when current policy oscillates.

No-regret learning and Nash in Markov games: If all agents run no-regret algorithms (e.g., EXP3 for adversarial bandits at each state), the empirical joint strategy converges to a correlated equilibrium of the Markov game. In zero-sum games, this is Nash. This is the strongest general convergence result available — no stronger guarantee is known without additional assumptions.

Why does time-averaging the strategies help convergence even when the current strategy oscillates?

The time average smooths out the oscillation; by the no-regret property, the average converges to Nash even if the instantaneous strategy cycles Averaging reduces the number of actions each agent considers Time averaging is equivalent to running a separate optimization for each time step

Chapter 9: Connections & Beyond

Markov games are the right model when agents can observe the full state. In practice, this assumption often fails: robots don't have omniscient sensors, financial agents don't see each other's order books, teammates can't observe each other's locations. Chapter 26 extends to partial observability.

Setting	Model	Solved by
Full obs., competitive	Zero-sum Markov game	Minimax Q, minimax VI
Full obs., general-sum	Markov game	Nash Q, fictitious play, GA
Partial obs., competitive	Partially obs. Markov game	Ch 26 methods
Partial obs., cooperative	Dec-POMDP	Ch 27 methods
Unknown model	MARL (model-free)	Nash Q, independent Q

Multi-agent RL in practice: Modern large-scale MARL systems (AlphaStar, OpenAI Five, AlphaGo) don't use Nash Q-learning — the joint action space is too large. Instead, they use: (1) independent learners with population training (self-play), (2) policy gradient methods with centralized critics (MADDPG, MAPPO), or (3) population-based training with diversity incentives. The theory from this chapter informs but doesn't directly scale to these systems.

Self-play and Nash:

In zero-sum games (chess, Go, poker), self-play (training against your own past policy) provably converges to Nash equilibrium under certain conditions. This is the basis for AlphaZero-style systems. The key insight: self-play is iterated best-response, and in zero-sum games with unique Nash, IBR converges.

Centralized training, decentralized execution:

A common paradigm in cooperative MARL: during training, allow agents to share observations and Q-functions (centralized critic). During execution, each agent acts on its own observations. This resolves non-stationarity during training while maintaining decentralized execution. QMIX, MADDPG, MAPPO all use this pattern.

Mean field games: When n → ∞, strategic interaction with the full population becomes intractable. Mean field game theory (Lasry & Lions 2007) approximates each agent's interaction as interaction with an aggregate distribution (the "mean field") of all others. This reduces n-player Nash to a single-agent control problem against a fixed distribution. Applicable to traffic, financial markets, large-scale robotics.

"Reinforcement learning in multi-agent environments is fundamentally a game. The environment is non-stationary not because the world changes, but because other agents learn and adapt."
— Paraphrase, Shoham, Powers & Grenager, "If multi-agent learning is the answer, what is the question?" (2007)

What is the "centralized training, decentralized execution" paradigm in cooperative MARL?

Agents share observations and Q-values during training (resolving non-stationarity), but act independently on their own observations at execution time A central agent makes all decisions during training; agents execute independently at test time Training uses Nash equilibrium; execution uses independent Q-learning

← Ch 24: Multiagent Reasoning Index Ch 26: State Uncertainty →