Games that unfold over time: agents observe state, choose actions, receive rewards, and transition. Markov games are what you get when multiple agents share an MDP — and Nash equilibrium has to learn Q-values.
You're a robot navigating a warehouse. You've modeled this as an MDP: state (location), action (move), reward (item collected), transition (where you end up). Solved. Optimal policy computed.
Now a second robot enters the warehouse. Same goal. Your paths cross. If you both try to take the same item, conflict. If you coordinate, you both do better. The MDP framework has no model of this — it assumes the environment is fixed, not that it contains another decision-maker.
A Markov game (also called a stochastic game) generalizes the MDP to multiple agents. The world has state, and agents observe and act in it — but now the transition and reward depend on the joint action of all agents.
See how the formalism expands from single-agent to multi-agent. Every component that was about one agent now becomes a joint tuple.
A Markov game (Shapley, 1953) for n agents is a tuple (I, S, A1, …, An, T, R1, …, Rn, γ):
| Component | Notation | Meaning |
|---|---|---|
| Agents | I = {1, …, n} | The players |
| State space | S | Observable world state |
| Action sets | Ai | Actions available to each agent |
| Transition | T(s'|s, a) | Next state distribution given joint action a |
| Reward functions | Ri(s, a, s') | Reward for agent i at each step |
| Discount | γ ∈ [0, 1) | Time-preference for rewards |
At each step: agents observe state s, simultaneously choose actions ai ~ πi(s), world transitions to s' ~ T(·|s,a), each agent receives reward Ri(s,a,s'). Repeat. Each agent seeks to maximize their own expected discounted sum ∑t γt Ri,t.
Agent i's policy πi : S → Δ(Ai) maps states to action distributions. A joint policy π = (π1, …, πn) determines all agents' behavior. The value of a joint policy from state s for agent i is:
At each state s, the agents face a stage game — a normal form game with payoffs determined by V. This is the key decomposition: a Markov game is a collection of state-indexed stage games, connected by the transition dynamics.
In sequential settings, a response model is no longer just a distribution over actions — it's a model of the opponent's policy πj : S → Δ(Aj). If I know your policy, I can solve my own MDP optimally.
The simplest model: assume opponents play a fixed stationary policy (same distribution over actions at each state, regardless of history). With this assumption, the Markov game from agent i's perspective reduces to an MDP with modified transition:
This averaged transition treats other agents as part of the stochastic environment. Agent i can then run any standard RL algorithm against this effective MDP.
Agent 2 (orange) follows a fixed policy (moves right). Watch Agent 1 (teal) solve its induced MDP and find the best response. Drag the rationality slider to see softmax vs. greedy behavior.
A joint policy π* = (π1*, …, πn*) is a Nash equilibrium for a Markov game if, at every state s, no agent can improve their value by unilaterally changing their policy, given that all other agents play π−i*:
The Nash Q-function for agent i is the expected discounted reward from taking action ai in state s, given the joint Nash policy thereafter:
And the Nash value is: ViNash(s) = NashVali(Q1Nash(s,·), …, QnNash(s,·)) — the value agent i gets in the Nash equilibrium of the stage game at s.
Minimax/Nash value iteration requires knowing T and R. In practice, we often don't. Nash Q-learning (Hu & Wellman, 2003) extends Q-learning to Markov games by replacing the max operator with a Nash equilibrium solver.
Each agent i maintains a Q-table Qi(s, a) indexed by state and joint action. After experiencing a transition (s, a, ri, s'), agent i updates:
where NashVali(s') is computed by solving the Nash equilibrium of the stage game at s' with payoff matrices Qi(s', ·) for each agent i. The Nash solution gives both the equilibrium strategy π*(s') and the equilibrium values.
python import numpy as np class NashQLearner: def __init__(self, n_agents, n_states, n_actions, alpha=0.1, gamma=0.95): self.n = n_agents; self.alpha = alpha; self.gamma = gamma # Q[i][s, a1, a2, ...] — joint action Q-table for agent i shape = [n_states] + [n_actions]*n_agents self.Q = [np.zeros(shape) for _ in range(n_agents)] def update(self, s, joint_a, rewards, sp): nash_vals = self.nash_values(sp) # solve stage game at sp for i in range(self.n): old = self.Q[i][s][tuple(joint_a)] target = rewards[i] + self.gamma * nash_vals[i] self.Q[i][s][tuple(joint_a)] = (1-self.alpha)*old + self.alpha*target def nash_values(self, s): # For 2-agent zero-sum: minimax LP. For general-sum: Nash solver. # Here: simplified for 2-agent 2-action zero-sum case U1 = self.Q[0][s] # 2x2 payoff matrix for agent 1 U2 = self.Q[1][s] pi1, pi2 = solve_nash_2x2(U1, U2) v1 = (pi1 @ U1 @ pi2) v2 = (pi1 @ U2 @ pi2) return [v1, v2]
Fictitious play for Markov games extends the normal form algorithm from Ch 24 to sequential settings. Each agent maintains empirical counts of the opponents' actions at each state, best-responds to the historical frequency, and updates its own Q-function accordingly.
The state-indexed version: agent i maintains counts Cj(s, aj) for each opponent j — how many times j played aj at state s. The empirical frequency is πjemp(aj|s) = Cj(s,aj) / ∑a Cj(s,a). Agent i best-responds to this empirical policy by solving an MDP with modified transitions.
Nash Q-learning tries to reach Nash from the start. Fictitious play learns an approximate best-response to observed opponent play — simpler to implement (no Nash solver needed), but requires opponents to settle to a stationary policy for convergence.
Watch empirical frequencies converge over 200 rounds. The current strategy (bright dot) oscillates; the time-average (trail) converges to Nash (gold star).
Gradient ascent from Ch 24 extends directly to Markov games. Each agent maintains a parametric policy πi(a|s; θi) and updates the parameters by ascending the gradient of their expected value:
For tabular policies (πi(a|s) directly as parameters), the gradient at each state reduces to the policy gradient theorem: the gradient is proportional to the advantage Ai(s, ai) = Qi(s, ai, π−i) − Vi(s). This is the same as REINFORCE from Ch 11, but with multi-agent Q-values.
python import numpy as np def multiagent_gradient_ascent(game, pi_init, alpha=0.01, T=1000): """Simultaneous gradient ascent for tabular Markov game. game: MarkovGame object with Q-evaluation method. pi_init: list of (nS x nA) strategy matrices.""" policies = [p.copy() for p in pi_init] history = [] for t in range(T): # Evaluate all agents' Q-functions under current joint policy Q_vals = game.evaluate(policies) # list of (nS x nA_i) arrays # Compute advantage: A_i(s, a_i) = Q_i(s, a_i) - V_i(s) for i in range(game.n_agents): V_i = (policies[i] * Q_vals[i]).sum(axis=1, keepdims=True) A_i = Q_vals[i] - V_i # advantage: (nS x nA_i) # Gradient ascent on log-policy parameters policies[i] = policies[i] + alpha * A_i policies[i] = np.maximum(policies[i], 1e-10) policies[i] /= policies[i].sum(axis=1, keepdims=True) # renorm history.append([p.copy() for p in policies]) return history
A 4×4 grid world. Agent 1 (teal) starts top-left, Agent 2 (orange) starts top-right. There is one coin (yellow star) in the grid. The agent who reaches it first gets +10; the other gets 0. Colliding costs both −2.
Choose an algorithm and watch the agents learn. After training, click "Run Episode" to see the learned policies in action. Agent 1 = teal, Agent 2 = orange, Coin = gold.
Each agent runs standard Q-learning, ignoring the other. Simple but non-stationary: each agent's updates shift the environment for the other. Often converges in practice despite no theoretical guarantee.
Each agent maintains joint-action Q-tables and solves the stage game at each state to get Nash values. Heavier computation but principled. Converges under conditions from Hu & Wellman (2003).
None of the three algorithms (Nash Q-learning, fictitious play, gradient ascent) have universal convergence guarantees for general Markov games. Understanding when and why they converge — and when they don't — is crucial for applying them.
| Algorithm | Zero-Sum | Cooperative | General-Sum | Key Requirement |
|---|---|---|---|---|
| Nash Q-Learning | Yes ✓ | Yes ✓ | Conditions | Unique global Nash at each state |
| Minimax Q (zero-sum only) | Yes ✓ | N/A | No | 2-player, zero-sum, tabular |
| Fictitious Play | Yes ✓ (avg) | Yes ✓ | No | Stationary opponents, finite states |
| Gradient Ascent | Yes ✓ (avg) | Yes ✓ | No | Diminishing step sizes, convexity |
| Independent Q-Learning | No | Yes (sometimes) | No | Stationary effective environment |
Plot of agent 1's probability of playing Action 0 over training. Solid line = current strategy, dashed = time average. The time average converges to Nash even when current policy oscillates.
Markov games are the right model when agents can observe the full state. In practice, this assumption often fails: robots don't have omniscient sensors, financial agents don't see each other's order books, teammates can't observe each other's locations. Chapter 26 extends to partial observability.
| Setting | Model | Solved by |
|---|---|---|
| Full obs., competitive | Zero-sum Markov game | Minimax Q, minimax VI |
| Full obs., general-sum | Markov game | Nash Q, fictitious play, GA |
| Partial obs., competitive | Partially obs. Markov game | Ch 26 methods |
| Partial obs., cooperative | Dec-POMDP | Ch 27 methods |
| Unknown model | MARL (model-free) | Nash Q, independent Q |
Self-play and Nash:
In zero-sum games (chess, Go, poker), self-play (training against your own past policy) provably converges to Nash equilibrium under certain conditions. This is the basis for AlphaZero-style systems. The key insight: self-play is iterated best-response, and in zero-sum games with unique Nash, IBR converges.
Centralized training, decentralized execution:
A common paradigm in cooperative MARL: during training, allow agents to share observations and Q-functions (centralized critic). During execution, each agent acts on its own observations. This resolves non-stationarity during training while maintaining decentralized execution. QMIX, MADDPG, MAPPO all use this pattern.
"Reinforcement learning in multi-agent environments is fundamentally a game. The environment is non-stationary not because the world changes, but because other agents learn and adapt."
— Paraphrase, Shoham, Powers & Grenager, "If multi-agent learning is the answer, what is the question?" (2007)