When two agents share a world, your best action depends on what the other agent does — and they're thinking the same thing about you. Game theory cuts through this infinite regress.
You're in an intersection. Another driver is about to cross too. If you both go, you crash. If you both wait, you both sit forever. The right action for you depends entirely on what the other driver does — and the other driver is thinking the exact same thing about you.
This is the core challenge of multiagent reasoning: when multiple decision-makers share an environment, each agent's optimal action depends on what the others do. Standard MDPs have no concept of this — the transition model T(s'|s,a) treats the world as a fixed backdrop. Add a second agent with its own policy and that backdrop becomes adversarial, competitive, or collaborative.
Game theory is the mathematical study of strategic interaction. Developed by von Neumann and Nash in the 1940s–50s, it asks: what does "rational" behavior look like when multiple self-interested agents interact? The answer is subtle and often counterintuitive.
Driver 1 (you, teal) and Driver 2 (orange) each choose Go or Wait. Click a cell to see what happens at that joint outcome.
The simplest multiagent setting is a normal form game (also called a matrix game). Everything happens in a single step: agents simultaneously choose actions, then each receives a reward that depends on the joint action.
Formally, a normal form game for i = 1, …, n agents has three components:
| Component | Notation | Meaning |
|---|---|---|
| Agent set | i ∈ {1, …, n} | The players |
| Action sets | Ai | The choices available to each agent |
| Utility functions | Ui(a1, …, an) | Agent i's reward for the joint action |
We write the joint action as a = (a1, …, an) ∈ A1 × ··· × An. Agent i's goal is to maximize Ui(a), but a depends on all agents' choices simultaneously.
Drag to change the temptation payoff T (reward for defecting while other cooperates). Watch the Nash equilibrium cell and payoffs update.
To choose optimally, agent i needs a model of what agent j will do. A response model is a distribution πj over j's actions. Given πj, agent i can compute its expected utility for each action ai:
The simplest model: agent j plays uniformly at random. More sophisticated: level-k reasoning. A level-0 agent plays uniformly. A level-1 agent best-responds to level-0 beliefs. A level-k agent best-responds to a level-(k−1) opponent.
Given π-i, agent i picks the single action with highest expected utility. Deterministic, but the parameter space is discrete — hard to optimize with gradients.
Agent i plays proportional to exponentiated expected utility. Smooth, differentiable, controllable by λ.
Two actions with Q-values [3, −1]. As λ increases, the response concentrates on the better action. λ=0 is uniform random; λ→∞ is hard best-response.
Sometimes one action is simply better than all others, regardless of what opponents do. This is called a dominant strategy, and it gives the most decisive solution concept.
Action ai strictly dominates action a'i if, for every possible joint action of the opponents:
A rational agent never plays a strictly dominated action. This gives us a pruning rule: repeatedly eliminate dominated actions from all agents — iterated elimination of dominated strategies (IEDS). Sometimes this fully solves the game in a unique action profile.
A 3×3 game. Teal numbers = Agent 1's payoffs, orange = Agent 2's. Click "Eliminate" to step through removing dominated strategies. Watch the matrix shrink.
Most games have no dominant strategy. We need a solution concept for the general case. The most important concept in game theory is the Nash equilibrium, introduced by John Nash in 1950.
A joint strategy profile (π1*, …, πn*) is a Nash equilibrium if no agent can improve their expected utility by unilaterally changing their strategy, given that all other agents play their equilibrium strategy:
In words: at a Nash equilibrium, every agent is already playing a best response to what everyone else is doing. There's no incentive for any single agent to deviate. It's a fixed point of mutual best-response.
Random 3×3 game. Teal = Agent 1 payoff, orange = Agent 2 payoff. A cell is a pure Nash if it's simultaneously the best in its column (Agent 1) and best in its row (Agent 2).
Many games have no pure Nash equilibrium. Matching Pennies is the classic: Agent 1 wins if coins match, Agent 2 wins if they differ. Check any pure profile — one agent always wants to deviate. No pure Nash exists.
The resolution: allow randomization. A mixed strategy πi ∈ Δ(Ai) is a probability distribution over actions. Nash's theorem guarantees at least one Nash equilibrium in mixed strategies for every finite game.
The key insight for finding mixed Nash: if agent i mixes between two actions, those actions must give equal expected utility. Otherwise, i would deviate to the better one. So agent j's equilibrium mixing is chosen to make agent i indifferent:
Agent 2 plays Tails with probability p. Drag to see Agent 1's expected utilities for each pure action. The Nash equilibrium is where the lines cross (agent 1 is indifferent between Heads and Tails).
python import numpy as np from itertools import combinations def find_mixed_nash_2x2(U1, U2): """Find all Nash equilibria for 2x2 game. U1, U2: 2x2 numpy arrays. Returns list of (pi1, pi2) tuples.""" m, n = 2, 2 equilibria = [] # Check pure Nash (support size 1x1) for r in range(m): for c in range(n): br1 = U1[r,c] == U1[:,c].max() # agent 1 BR at col c br2 = U2[r,c] == U2[r,:].max() # agent 2 BR at row r if br1 and br2: pi1 = np.eye(m)[r]; pi2 = np.eye(n)[c] equilibria.append((pi1, pi2)) # Check fully mixed Nash (support size 2x2) # Agent 2 mixes so agent 1 is indifferent: U1[0,:] @ pi2 = U1[1,:] @ pi2 # => (U1[0,0]-U1[1,0])*p + (U1[0,1]-U1[1,1])*(1-p) = 0 d1 = (U1[0,0] - U1[1,0]) - (U1[0,1] - U1[1,1]) if abs(d1) > 1e-10: p2 = (U1[1,1] - U1[0,1]) / d1 # pi2(action 0) d2 = (U2[0,0] - U2[0,1]) - (U2[1,0] - U2[1,1]) if abs(d2) > 1e-10: p1 = (U2[1,1] - U2[1,0]) / d2 # pi1(action 0) if 0 <= p1 <= 1 and 0 <= p2 <= 1: equilibria.append((np.array([p1, 1-p1]), np.array([p2, 1-p2]))) return equilibria
Nash equilibrium assumes agents choose strategies independently. But what if a trusted mediator could send private recommendations to each agent? This leads to a broader concept: the correlated equilibrium, introduced by Robert Aumann in 1974.
A correlated equilibrium is a joint distribution σ(a) over action profiles such that, for each agent i and each action ai in the support, following the recommendation is a best response given that all other agents also follow their (private, unseen) recommendations:
Note: agent i sees only their own recommendation, not the others'. Given this, following it must be rational. This is the obedience constraint.
Agent 1 prefers (Opera, Opera)=2, Agent 2 prefers (Football, Football)=2. Compare expected utilities under each equilibrium concept. The correlated equilibrium achieves 1.5 for each — beating the mixed Nash (0.67).
How do agents reach a Nash equilibrium without a central coordinator? Three algorithms dominate the literature: iterated best response (IBR), fictitious play (FP), and gradient ascent (GA). Each takes a different approach to the same goal: finding the fixed point of mutual best-response.
Pick a game and algorithm. Watch the joint strategy evolve in the [0,1]×[0,1] strategy space. X-axis = P(Agent 1 plays Action 0). Y-axis = P(Agent 2 plays Action 0). Gold star = Nash equilibrium. Trail shows history.
Each agent maintains counts of opponent past actions. At each step, best-respond to the empirical frequency. The current strategy may cycle, but the time-averaged strategy converges to Nash in zero-sum and potential games.
Each agent updates their mixed strategy in the direction of the gradient of expected utility. Project back onto the simplex after each step. Time-average converges in zero-sum games. In general-sum games, can spiral or diverge.
Gradient ascent treats each agent's strategy as a continuous parameter and updates it in the direction of increasing expected utility. It's the game-theoretic analog of gradient descent in optimization — except there's no single objective.
Agent i with mixed strategy πi ∈ Δ(Ai) updates via:
The gradient of Ui with respect to πi(ai) is simply the expected utility of playing ai against the current π−i — i.e., Qi(ai, π−i). So the update concentrates probability mass on high-utility actions.
The projΔ operation projects back onto the probability simplex: clamp negatives to 0, then renormalize. This is the only constraint — the parameter is a valid probability distribution.
python import numpy as np def gradient_ascent(U1, U2, alpha=0.02, T=500): """Gradient ascent for 2-player general-sum game. U1, U2: (m x n) payoff matrices. Returns list of (pi1, pi2) pairs (the trajectory).""" m, n = U1.shape pi1 = np.ones(m) / m # start uniform pi2 = np.ones(n) / n history = [(pi1.copy(), pi2.copy())] for _ in range(T): # Gradient: Q_i(a_i, pi_{-i}) for each action a_i g1 = U1 @ pi2 # shape (m,): EU of each of agent 1's actions g2 = U2.T @ pi1 # shape (n,): EU of each of agent 2's actions pi1 = pi1 + alpha * g1 pi2 = pi2 + alpha * g2 # Project onto simplex: clip, renormalize pi1 = np.maximum(pi1, 1e-10); pi1 /= pi1.sum() pi2 = np.maximum(pi2, 1e-10); pi2 /= pi2.sum() history.append((pi1.copy(), pi2.copy())) return history def fictitious_play(U1, U2, T=500): """Fictitious play: best-respond to empirical opponent frequency.""" m, n = U1.shape counts1 = np.ones(m) # Laplace smoothing counts2 = np.ones(n) history = [] for t in range(T): freq1 = counts1 / counts1.sum() freq2 = counts2 / counts2.sum() # Best responses a1 = np.argmax(U1 @ freq2) a2 = np.argmax(U2.T @ freq1) counts1[a1] += 1 counts2[a2] += 1 history.append((freq1.copy(), freq2.copy())) return history
Normal form games are the foundation, but real-world multiagent problems are sequential, partially observable, and involve many agents. The concepts from this chapter extend in several important directions covered next.
| Concept (Ch 24) | Limitation | Extension |
|---|---|---|
| Normal form games | Single-shot, simultaneous, full info | Markov games (Ch 25), extensive form |
| Pure/mixed Nash | Multiple equilibria, hard to compute (PPAD) | Correlated eq. (LP), approximate Nash |
| Fictitious play / GA | Convergence only in special classes | No-regret learning, CFR |
| 2-player games | Doesn't scale to many agents | Mean field games, population games |
| Independent utilities | No mechanism to coordinate | Mechanism design, auction theory |
Chapter 25: Sequential Problems
When the game repeats across time with state transitions, we get Markov games. Nash equilibrium generalizes to Nash Q-functions. Fictitious play and gradient ascent carry over using Q-value estimates instead of utility lookups.
Chapter 27: Collaborative Agents
When all agents share a reward function (Dec-POMDPs), the multiagent problem reduces to coordination under partial observability. The "game" perspective dissolves — there's no conflict, only the challenge of acting well without full communication.
"Nash equilibrium is not a prescription for how to play — it is a description of what rational play must look like if it is to be stable against unilateral deviation."
— Paraphrase, Osborne & Rubinstein, A Course in Game Theory (1994)