Kochenderfer, Wheeler & Wray, Chapter 24

Multiagent Reasoning

When two agents share a world, your best action depends on what the other agent does — and they're thinking the same thing about you. Game theory cuts through this infinite regress.

Prerequisites: MDPs (Ch 7) + basic probability. No prior game theory needed.
10
Chapters
5
Simulations
10
Quizzes

Chapter 0: Why Games?

You're in an intersection. Another driver is about to cross too. If you both go, you crash. If you both wait, you both sit forever. The right action for you depends entirely on what the other driver does — and the other driver is thinking the exact same thing about you.

This is the core challenge of multiagent reasoning: when multiple decision-makers share an environment, each agent's optimal action depends on what the others do. Standard MDPs have no concept of this — the transition model T(s'|s,a) treats the world as a fixed backdrop. Add a second agent with its own policy and that backdrop becomes adversarial, competitive, or collaborative.

Game theory is the mathematical study of strategic interaction. Developed by von Neumann and Nash in the 1940s–50s, it asks: what does "rational" behavior look like when multiple self-interested agents interact? The answer is subtle and often counterintuitive.

The fundamental tension: In a single-agent MDP, "optimal" is well-defined — maximize expected discounted reward. In a multiagent game, there's no single objective to maximize. What's optimal for you depends on what I do, and vice versa. Game theory replaces "optimal policy" with equilibrium — a joint strategy where no agent can unilaterally improve their payoff.
The Intersection Game — Payoff Matrix

Driver 1 (you, teal) and Driver 2 (orange) each choose Go or Wait. Click a cell to see what happens at that joint outcome.

Why MDPs aren't enough: An MDP models a single agent in a stochastic environment. The randomness (T, O) is assumed fixed and independent of the agent's policy. When another agent is present, the "randomness" is actually another policy — one that may adapt and respond to what you do. The environment is no longer stationary from your perspective.
Why can't we solve a 2-agent interaction simply by running two separate MDPs?

Chapter 1: Normal Form Games

The simplest multiagent setting is a normal form game (also called a matrix game). Everything happens in a single step: agents simultaneously choose actions, then each receives a reward that depends on the joint action.

Formally, a normal form game for i = 1, …, n agents has three components:

ComponentNotationMeaning
Agent seti ∈ {1, …, n}The players
Action setsAiThe choices available to each agent
Utility functionsUi(a1, …, an)Agent i's reward for the joint action

We write the joint action as a = (a1, …, an) ∈ A1 × ··· × An. Agent i's goal is to maximize Ui(a), but a depends on all agents' choices simultaneously.

The Prisoner's Dilemma — the canonical example: Two suspects interrogated separately. Each can Cooperate (stay silent) or Defect (betray). (C,C): −1 each. (D,D): −5 each. (D,C): defector gets 0, cooperator gets −10. The tragedy: Defect dominates Cooperate for both players individually, yet both cooperating gives a better outcome for everyone. Individual rationality leads to collective irrationality.
Prisoner's Dilemma — Temptation Slider

Drag to change the temptation payoff T (reward for defecting while other cooperates). Watch the Nash equilibrium cell and payoffs update.

Temptation T 10.0
Zero-sum vs. general-sum: A game is zero-sum if ∑i Ui(a) = 0 for all a — one agent's gain is exactly another's loss (poker, chess). A game is general-sum if utilities can be anything — cooperation and conflict can coexist. Most real-world multiagent problems are general-sum.
In the Prisoner's Dilemma, what makes "Both Defect" the Nash equilibrium despite both players preferring "Both Cooperate"?

Chapter 2: Response Models

To choose optimally, agent i needs a model of what agent j will do. A response model is a distribution πj over j's actions. Given πj, agent i can compute its expected utility for each action ai:

Qi(ai, π-i) = ∑a-i π-i(a-i) · Ui(ai, a-i)

The simplest model: agent j plays uniformly at random. More sophisticated: level-k reasoning. A level-0 agent plays uniformly. A level-1 agent best-responds to level-0 beliefs. A level-k agent best-responds to a level-(k−1) opponent.

The type hierarchy problem: Level-k reasoning requires choosing k. If you model the opponent as level-1 and they're actually level-2, your prediction is wrong. Game theory resolves this with equilibrium — a fixed point where every agent's model of the opponent is correct because everyone is best-responding to the actual equilibrium strategy.

Hard Best Response

Given π-i, agent i picks the single action with highest expected utility. Deterministic, but the parameter space is discrete — hard to optimize with gradients.

ai* = argmaxai Qi(ai, π-i)

Softmax (Quantal) Response

Agent i plays proportional to exponentiated expected utility. Smooth, differentiable, controllable by λ.

πi(ai) ∝ exp(λ · Qi(ai, π-i))
Softmax Response — Rationality Slider

Two actions with Q-values [3, −1]. As λ increases, the response concentrates on the better action. λ=0 is uniform random; λ→∞ is hard best-response.

λ (rationality) 1.0
Hierarchical softmax (quantal cognitive hierarchy): Real humans use a mix of levels. The cognitive hierarchy model assumes level k is Poisson(τ)-distributed in the population, each level using softmax response. This model fits human experimental data far better than assuming everyone is rational or everyone is level-1.
What does the precision parameter λ control in the softmax response model?

Chapter 3: Dominant Strategy Equilibrium

Sometimes one action is simply better than all others, regardless of what opponents do. This is called a dominant strategy, and it gives the most decisive solution concept.

Action ai strictly dominates action a'i if, for every possible joint action of the opponents:

Ui(ai, a−i) > Ui(a'i, a−i)   ∀ a−i

A rational agent never plays a strictly dominated action. This gives us a pruning rule: repeatedly eliminate dominated actions from all agents — iterated elimination of dominated strategies (IEDS). Sometimes this fully solves the game in a unique action profile.

The Prisoner's Dilemma resolved by dominance: "Defect" strictly dominates "Cooperate" for both agents. If the other cooperates: Defect gives 0 vs −1. If the other defects: Defect gives −5 vs −10. Defect wins both comparisons for every opponent action. IEDS immediately gives (Defect, Defect) as the unique rational outcome.
Iterated Elimination of Dominated Strategies

A 3×3 game. Teal numbers = Agent 1's payoffs, orange = Agent 2's. Click "Eliminate" to step through removing dominated strategies. Watch the matrix shrink.

3 actions each. Find the dominated strategy.
Weak vs. strict dominance: Action ai weakly dominates a'i if it's at least as good in every case and strictly better in at least one. Iterated weak dominance is order-dependent — eliminating in different orders can leave different games. Strict dominance is order-independent: the final result is always the same regardless of which dominated action you eliminate first.
Action A strictly dominates action B for agent i if:

Chapter 4: Nash Equilibrium

Most games have no dominant strategy. We need a solution concept for the general case. The most important concept in game theory is the Nash equilibrium, introduced by John Nash in 1950.

A joint strategy profile (π1*, …, πn*) is a Nash equilibrium if no agent can improve their expected utility by unilaterally changing their strategy, given that all other agents play their equilibrium strategy:

Uii*, π−i*) ≥ Uii, π−i*)   ∀ i,   ∀ πi

In words: at a Nash equilibrium, every agent is already playing a best response to what everyone else is doing. There's no incentive for any single agent to deviate. It's a fixed point of mutual best-response.

Nash's existence theorem (1950): Every finite game (finite players, finite action sets) has at least one Nash equilibrium, possibly in mixed strategies. The proof uses Brouwer's fixed-point theorem: the best-response correspondence maps the compact convex strategy simplex to itself and is upper hemicontinuous, so it has a fixed point. Nash was 21 when he proved this.
Find Pure Nash Equilibria

Random 3×3 game. Teal = Agent 1 payoff, orange = Agent 2 payoff. A cell is a pure Nash if it's simultaneously the best in its column (Agent 1) and best in its row (Agent 2).

Click Highlight to find Nash equilibria.
Nash equilibria are not necessarily efficient: The joint action maximizing total utility (social optimum) need not be a Nash equilibrium. The gap is measured by the price of anarchy — ratio of worst Nash utility to social optimum. In the Prisoner's Dilemma, the social optimum (Both Cooperate) is not a Nash equilibrium; the Nash equilibrium (Both Defect) gives each agent −5 vs −1.
At a Nash equilibrium, what is true about each agent's strategy?

Chapter 5: Mixed Strategy Nash

Many games have no pure Nash equilibrium. Matching Pennies is the classic: Agent 1 wins if coins match, Agent 2 wins if they differ. Check any pure profile — one agent always wants to deviate. No pure Nash exists.

The resolution: allow randomization. A mixed strategy πi ∈ Δ(Ai) is a probability distribution over actions. Nash's theorem guarantees at least one Nash equilibrium in mixed strategies for every finite game.

The key insight for finding mixed Nash: if agent i mixes between two actions, those actions must give equal expected utility. Otherwise, i would deviate to the better one. So agent j's equilibrium mixing is chosen to make agent i indifferent:

U1(a11, π2*) = U1(a12, π2*)    (agent 2 mixes to make agent 1 indifferent)
You mix to prevent exploitation, not to randomize for fun: In Matching Pennies, if agent 2 plays Heads deterministically, agent 1 exploits this by always playing Heads. Agent 2 must mix 50/50 to deny agent 1 any advantage. The equilibrium mixing is chosen to prevent the opponent from exploiting a predictable pattern — not for any intrinsic value of randomization.
Mixed Nash in Matching Pennies

Agent 2 plays Tails with probability p. Drag to see Agent 1's expected utilities for each pure action. The Nash equilibrium is where the lines cross (agent 1 is indifferent between Heads and Tails).

P(Agent 2 plays Tails) 0.50
python
import numpy as np
from itertools import combinations

def find_mixed_nash_2x2(U1, U2):
    """Find all Nash equilibria for 2x2 game.
    U1, U2: 2x2 numpy arrays. Returns list of (pi1, pi2) tuples."""
    m, n = 2, 2
    equilibria = []

    # Check pure Nash (support size 1x1)
    for r in range(m):
        for c in range(n):
            br1 = U1[r,c] == U1[:,c].max()  # agent 1 BR at col c
            br2 = U2[r,c] == U2[r,:].max()  # agent 2 BR at row r
            if br1 and br2:
                pi1 = np.eye(m)[r]; pi2 = np.eye(n)[c]
                equilibria.append((pi1, pi2))

    # Check fully mixed Nash (support size 2x2)
    # Agent 2 mixes so agent 1 is indifferent: U1[0,:] @ pi2 = U1[1,:] @ pi2
    # => (U1[0,0]-U1[1,0])*p + (U1[0,1]-U1[1,1])*(1-p) = 0
    d1 = (U1[0,0] - U1[1,0]) - (U1[0,1] - U1[1,1])
    if abs(d1) > 1e-10:
        p2 = (U1[1,1] - U1[0,1]) / d1  # pi2(action 0)
        d2 = (U2[0,0] - U2[0,1]) - (U2[1,0] - U2[1,1])
        if abs(d2) > 1e-10:
            p1 = (U2[1,1] - U2[1,0]) / d2  # pi1(action 0)
            if 0 <= p1 <= 1 and 0 <= p2 <= 1:
                equilibria.append((np.array([p1, 1-p1]), np.array([p2, 1-p2])))
    return equilibria
In a mixed Nash equilibrium, why does agent j randomize between actions?

Chapter 6: Correlated Equilibrium

Nash equilibrium assumes agents choose strategies independently. But what if a trusted mediator could send private recommendations to each agent? This leads to a broader concept: the correlated equilibrium, introduced by Robert Aumann in 1974.

A correlated equilibrium is a joint distribution σ(a) over action profiles such that, for each agent i and each action ai in the support, following the recommendation is a best response given that all other agents also follow their (private, unseen) recommendations:

a−i σ(ai, a−i) [Ui(ai, a−i) − Ui(a'i, a−i)] ≥ 0   ∀ i, ai, a'i

Note: agent i sees only their own recommendation, not the others'. Given this, following it must be rational. This is the obedience constraint.

Traffic lights are correlated equilibria: Two drivers at an intersection. Pure Nash: one goes, one waits (but which one?). Mixed Nash: both randomize, risking crashes. A traffic light is a mediator that coordinates: "you go, you wait." Each driver obeys because, given the other obeys, it's their best response. The light achieves coordination that Nash equilibrium can't guarantee.
Battle of the Sexes — Equilibrium Comparison

Agent 1 prefers (Opera, Opera)=2, Agent 2 prefers (Football, Football)=2. Compare expected utilities under each equilibrium concept. The correlated equilibrium achieves 1.5 for each — beating the mixed Nash (0.67).

Click an equilibrium to see the joint distribution and expected utilities.
CE is easier to compute than Nash: Every Nash equilibrium is a correlated equilibrium (take the product distribution). But CE is a larger, convex set — a polytope defined by linear inequalities (the obedience constraints). Maximizing any linear objective over CE is a linear program, solvable in polynomial time. Finding Nash requires solving polynomial equations — PPAD-hard in general.
What distinguishes a correlated equilibrium from a Nash equilibrium?

Chapter 7: SHOWCASE — Learning to Equilibrium

How do agents reach a Nash equilibrium without a central coordinator? Three algorithms dominate the literature: iterated best response (IBR), fictitious play (FP), and gradient ascent (GA). Each takes a different approach to the same goal: finding the fixed point of mutual best-response.

IBR doesn't always converge: In Matching Pennies, IBR cycles forever — Heads provokes Tails, which provokes Heads, … Convergence is guaranteed only in specific game classes: potential games, supermodular games, dominance-solvable games. Fictitious play and gradient ascent have weaker but broader convergence guarantees.
SHOWCASE: IBR vs. Fictitious Play vs. Gradient Ascent

Pick a game and algorithm. Watch the joint strategy evolve in the [0,1]×[0,1] strategy space. X-axis = P(Agent 1 plays Action 0). Y-axis = P(Agent 2 plays Action 0). Gold star = Nash equilibrium. Trail shows history.

Step 0 P1: [0.50, 0.50] P2: [0.50, 0.50]

Fictitious Play

Each agent maintains counts of opponent past actions. At each step, best-respond to the empirical frequency. The current strategy may cycle, but the time-averaged strategy converges to Nash in zero-sum and potential games.

Gradient Ascent

Each agent updates their mixed strategy in the direction of the gradient of expected utility. Project back onto the simplex after each step. Time-average converges in zero-sum games. In general-sum games, can spiral or diverge.

Convergence summary: IBR converges in potential games and dominance-solvable games. Fictitious play time-average converges in zero-sum, certain potential games, and 2×2 general-sum games. Gradient ascent time-average converges in zero-sum games (multiplicative weights = no-regret). None of the three converge reliably in general-sum games.

Chapter 8: Gradient Ascent in Games

Gradient ascent treats each agent's strategy as a continuous parameter and updates it in the direction of increasing expected utility. It's the game-theoretic analog of gradient descent in optimization — except there's no single objective.

Agent i with mixed strategy πi ∈ Δ(Ai) updates via:

πi ← projΔi + α ∇πi Uii, π−i))

The gradient of Ui with respect to πi(ai) is simply the expected utility of playing ai against the current π−i — i.e., Qi(ai, π−i). So the update concentrates probability mass on high-utility actions.

The projΔ operation projects back onto the probability simplex: clamp negatives to 0, then renormalize. This is the only constraint — the parameter is a valid probability distribution.

Gradient ascent is simultaneous, not sequential: Both agents update at the same time. This is fundamentally different from optimization: there's no single loss function being minimized. The gradient of one agent's utility depends on the other's current strategy, which is also changing. This creates oscillation, cycles, and spirals that single-agent optimization never exhibits.
python
import numpy as np

def gradient_ascent(U1, U2, alpha=0.02, T=500):
    """Gradient ascent for 2-player general-sum game.
    U1, U2: (m x n) payoff matrices.
    Returns list of (pi1, pi2) pairs (the trajectory)."""
    m, n = U1.shape
    pi1 = np.ones(m) / m   # start uniform
    pi2 = np.ones(n) / n

    history = [(pi1.copy(), pi2.copy())]
    for _ in range(T):
        # Gradient: Q_i(a_i, pi_{-i}) for each action a_i
        g1 = U1 @ pi2   # shape (m,): EU of each of agent 1's actions
        g2 = U2.T @ pi1  # shape (n,): EU of each of agent 2's actions

        pi1 = pi1 + alpha * g1
        pi2 = pi2 + alpha * g2

        # Project onto simplex: clip, renormalize
        pi1 = np.maximum(pi1, 1e-10); pi1 /= pi1.sum()
        pi2 = np.maximum(pi2, 1e-10); pi2 /= pi2.sum()
        history.append((pi1.copy(), pi2.copy()))

    return history

def fictitious_play(U1, U2, T=500):
    """Fictitious play: best-respond to empirical opponent frequency."""
    m, n = U1.shape
    counts1 = np.ones(m)   # Laplace smoothing
    counts2 = np.ones(n)
    history = []
    for t in range(T):
        freq1 = counts1 / counts1.sum()
        freq2 = counts2 / counts2.sum()
        # Best responses
        a1 = np.argmax(U1 @ freq2)
        a2 = np.argmax(U2.T @ freq1)
        counts1[a1] += 1
        counts2[a2] += 1
        history.append((freq1.copy(), freq2.copy()))
    return history
Multiplicative weights = no-regret = time-averaged Nash in zero-sum: Gradient ascent in zero-sum games is equivalent to the multiplicative weights algorithm, which has the no-regret property: the time-averaged strategy achieves expected utility ≥ best fixed strategy minus O(1/√T). By a minimax argument, the pair of time-averaged strategies forms an ε-Nash equilibrium with ε = O(1/√T).
In gradient ascent for games, what is ∇πi(ai) Ui equal to?

Chapter 9: Connections & Beyond

Normal form games are the foundation, but real-world multiagent problems are sequential, partially observable, and involve many agents. The concepts from this chapter extend in several important directions covered next.

Concept (Ch 24)LimitationExtension
Normal form gamesSingle-shot, simultaneous, full infoMarkov games (Ch 25), extensive form
Pure/mixed NashMultiple equilibria, hard to compute (PPAD)Correlated eq. (LP), approximate Nash
Fictitious play / GAConvergence only in special classesNo-regret learning, CFR
2-player gamesDoesn't scale to many agentsMean field games, population games
Independent utilitiesNo mechanism to coordinateMechanism design, auction theory
PPAD-hardness of Nash (Daskalakis, Goldberg, Papadimitriou 2006): Finding a Nash equilibrium in a general 2-player game is PPAD-complete — as hard as finding a Brouwer fixed point. This means no polynomial-time algorithm is known (unless PPAD = P, which is highly unlikely). This computational hardness motivates approximate methods: gradient ascent, fictitious play, and correlated equilibrium (which is LP-solvable).

Chapter 25: Sequential Problems

When the game repeats across time with state transitions, we get Markov games. Nash equilibrium generalizes to Nash Q-functions. Fictitious play and gradient ascent carry over using Q-value estimates instead of utility lookups.

Chapter 27: Collaborative Agents

When all agents share a reward function (Dec-POMDPs), the multiagent problem reduces to coordination under partial observability. The "game" perspective dissolves — there's no conflict, only the challenge of acting well without full communication.

No-regret learning and Nash: If all agents in a game run no-regret algorithms (e.g., multiplicative weights), the time-averaged joint strategy converges to the set of correlated equilibria. In zero-sum games, this is equivalent to Nash. This is a striking result: decentralized, individual rationality leads to a globally stable outcome.
"Nash equilibrium is not a prescription for how to play — it is a description of what rational play must look like if it is to be stable against unilateral deviation."
— Paraphrase, Osborne & Rubinstein, A Course in Game Theory (1994)
Why does the computational hardness of Nash equilibrium motivate correlated equilibrium as a practical alternative?
← Ch 23: Controller Abstractions Index Ch 25: Sequential Problems →