Ch 24: Multiagent Reasoning — Algorithms for Decision Making

Chapter 0: Why Games?

You're in an intersection. Another driver is about to cross too. If you both go, you crash. If you both wait, you both sit forever. The right action for you depends entirely on what the other driver does — and the other driver is thinking the exact same thing about you.

This is the core challenge of multiagent reasoning: when multiple decision-makers share an environment, each agent's optimal action depends on what the others do. Standard MDPs have no concept of this — the transition model T(s'|s,a) treats the world as a fixed backdrop. Add a second agent with its own policy and that backdrop becomes adversarial, competitive, or collaborative.

Game theory is the mathematical study of strategic interaction. Developed by von Neumann and Nash in the 1940s–50s, it asks: what does "rational" behavior look like when multiple self-interested agents interact? The answer is subtle and often counterintuitive.

The fundamental tension: In a single-agent MDP, "optimal" is well-defined — maximize expected discounted reward. In a multiagent game, there's no single objective to maximize. What's optimal for you depends on what I do, and vice versa. Game theory replaces "optimal policy" with equilibrium — a joint strategy where no agent can unilaterally improve their payoff.

The Intersection Game — Payoff Matrix

Driver 1 (you, teal) and Driver 2 (orange) each choose Go or Wait. Click a cell to see what happens at that joint outcome.

Why MDPs aren't enough: An MDP models a single agent in a stochastic environment. The randomness (T, O) is assumed fixed and independent of the agent's policy. When another agent is present, the "randomness" is actually another policy — one that may adapt and respond to what you do. The environment is no longer stationary from your perspective.

Why can't we solve a 2-agent interaction simply by running two separate MDPs?

Each agent's transition model depends on the other agent's policy, which isn't fixed MDPs can't handle two reward functions simultaneously The state space doubles, making computation intractable

Chapter 1: Normal Form Games

The simplest multiagent setting is a normal form game (also called a matrix game). Everything happens in a single step: agents simultaneously choose actions, then each receives a reward that depends on the joint action.

Formally, a normal form game for i = 1, …, n agents has three components:

Component	Notation	Meaning
Agent set	i ∈ {1, …, n}	The players
Action sets	A_i	The choices available to each agent
Utility functions	U_i(a₁, …, a_n)	Agent i's reward for the joint action

We write the joint action as a = (a₁, …, a_n) ∈ A₁ × ··· × A_n. Agent i's goal is to maximize U_i(a), but a depends on all agents' choices simultaneously.

The Prisoner's Dilemma — the canonical example: Two suspects interrogated separately. Each can Cooperate (stay silent) or Defect (betray). (C,C): −1 each. (D,D): −5 each. (D,C): defector gets 0, cooperator gets −10. The tragedy: Defect dominates Cooperate for both players individually, yet both cooperating gives a better outcome for everyone. Individual rationality leads to collective irrationality.

Prisoner's Dilemma — Temptation Slider

Drag to change the temptation payoff T (reward for defecting while other cooperates). Watch the Nash equilibrium cell and payoffs update.

Temptation T 10.0

Zero-sum vs. general-sum: A game is zero-sum if ∑_i U_i(a) = 0 for all a — one agent's gain is exactly another's loss (poker, chess). A game is general-sum if utilities can be anything — cooperation and conflict can coexist. Most real-world multiagent problems are general-sum.

In the Prisoner's Dilemma, what makes "Both Defect" the Nash equilibrium despite both players preferring "Both Cooperate"?

Defecting is illegal and agents prefer safe actions Agents need to communicate to reach cooperation Defect is each agent's best response regardless of what the other does, so neither can unilaterally improve by switching to Cooperate

Chapter 2: Response Models

To choose optimally, agent i needs a model of what agent j will do. A response model is a distribution π_j over j's actions. Given π_j, agent i can compute its expected utility for each action a_i:

Q_i(a_i, π_-i) = ∑_{a_-i} π_-i(a_-i) · U_i(a_i, a_-i)

The simplest model: agent j plays uniformly at random. More sophisticated: level-k reasoning. A level-0 agent plays uniformly. A level-1 agent best-responds to level-0 beliefs. A level-k agent best-responds to a level-(k−1) opponent.

The type hierarchy problem: Level-k reasoning requires choosing k. If you model the opponent as level-1 and they're actually level-2, your prediction is wrong. Game theory resolves this with equilibrium — a fixed point where every agent's model of the opponent is correct because everyone is best-responding to the actual equilibrium strategy.

Hard Best Response

Given π_-i, agent i picks the single action with highest expected utility. Deterministic, but the parameter space is discrete — hard to optimize with gradients.

a_i* = argmax_{a_i} Q_i(a_i, π_-i)

Softmax (Quantal) Response

Agent i plays proportional to exponentiated expected utility. Smooth, differentiable, controllable by λ.

π_i(a_i) ∝ exp(λ · Q_i(a_i, π_-i))

Softmax Response — Rationality Slider

Two actions with Q-values [3, −1]. As λ increases, the response concentrates on the better action. λ=0 is uniform random; λ→∞ is hard best-response.

λ (rationality) 1.0

Hierarchical softmax (quantal cognitive hierarchy): Real humans use a mix of levels. The cognitive hierarchy model assumes level k is Poisson(τ)-distributed in the population, each level using softmax response. This model fits human experimental data far better than assuming everyone is rational or everyone is level-1.

What does the precision parameter λ control in the softmax response model?

How concentrated the response is on the best action (λ=0 is uniform, large λ is near-greedy) The number of levels of recursive reasoning The discount factor for future rewards

Chapter 3: Dominant Strategy Equilibrium

Sometimes one action is simply better than all others, regardless of what opponents do. This is called a dominant strategy, and it gives the most decisive solution concept.

Action a_i strictly dominates action a'_i if, for every possible joint action of the opponents:

U_i(a_i, a_−i) > U_i(a'_i, a_−i) ∀ a_−i

A rational agent never plays a strictly dominated action. This gives us a pruning rule: repeatedly eliminate dominated actions from all agents — iterated elimination of dominated strategies (IEDS). Sometimes this fully solves the game in a unique action profile.

The Prisoner's Dilemma resolved by dominance: "Defect" strictly dominates "Cooperate" for both agents. If the other cooperates: Defect gives 0 vs −1. If the other defects: Defect gives −5 vs −10. Defect wins both comparisons for every opponent action. IEDS immediately gives (Defect, Defect) as the unique rational outcome.

Iterated Elimination of Dominated Strategies

A 3×3 game. Teal numbers = Agent 1's payoffs, orange = Agent 2's. Click "Eliminate" to step through removing dominated strategies. Watch the matrix shrink.

3 actions each. Find the dominated strategy.

Weak vs. strict dominance: Action a_i weakly dominates a'_i if it's at least as good in every case and strictly better in at least one. Iterated weak dominance is order-dependent — eliminating in different orders can leave different games. Strict dominance is order-independent: the final result is always the same regardless of which dominated action you eliminate first.

Action A strictly dominates action B for agent i if:

A gives strictly higher utility than B for every possible joint action of the other agents A gives higher average utility when opponents play uniformly at random A is preferred by a majority of players in the game

Chapter 4: Nash Equilibrium

Most games have no dominant strategy. We need a solution concept for the general case. The most important concept in game theory is the Nash equilibrium, introduced by John Nash in 1950.

A joint strategy profile (π₁*, …, π_n*) is a Nash equilibrium if no agent can improve their expected utility by unilaterally changing their strategy, given that all other agents play their equilibrium strategy:

U_i(π_i*, π_−i*) ≥ U_i(π_i, π_−i*) ∀ i, ∀ π_i

In words: at a Nash equilibrium, every agent is already playing a best response to what everyone else is doing. There's no incentive for any single agent to deviate. It's a fixed point of mutual best-response.

Nash's existence theorem (1950): Every finite game (finite players, finite action sets) has at least one Nash equilibrium, possibly in mixed strategies. The proof uses Brouwer's fixed-point theorem: the best-response correspondence maps the compact convex strategy simplex to itself and is upper hemicontinuous, so it has a fixed point. Nash was 21 when he proved this.

Find Pure Nash Equilibria

Random 3×3 game. Teal = Agent 1 payoff, orange = Agent 2 payoff. A cell is a pure Nash if it's simultaneously the best in its column (Agent 1) and best in its row (Agent 2).

Click Highlight to find Nash equilibria.

Nash equilibria are not necessarily efficient: The joint action maximizing total utility (social optimum) need not be a Nash equilibrium. The gap is measured by the price of anarchy — ratio of worst Nash utility to social optimum. In the Prisoner's Dilemma, the social optimum (Both Cooperate) is not a Nash equilibrium; the Nash equilibrium (Both Defect) gives each agent −5 vs −1.

At a Nash equilibrium, what is true about each agent's strategy?

Each agent is playing a best response to the other agents' Nash strategies — no unilateral deviation can improve their payoff Each agent maximizes the sum of all agents' utilities The joint action maximizes total social welfare

Chapter 5: Mixed Strategy Nash

Many games have no pure Nash equilibrium. Matching Pennies is the classic: Agent 1 wins if coins match, Agent 2 wins if they differ. Check any pure profile — one agent always wants to deviate. No pure Nash exists.

The resolution: allow randomization. A mixed strategy π_i ∈ Δ(A_i) is a probability distribution over actions. Nash's theorem guarantees at least one Nash equilibrium in mixed strategies for every finite game.

The key insight for finding mixed Nash: if agent i mixes between two actions, those actions must give equal expected utility. Otherwise, i would deviate to the better one. So agent j's equilibrium mixing is chosen to make agent i indifferent:

U₁(a₁¹, π₂*) = U₁(a₁², π₂*) (agent 2 mixes to make agent 1 indifferent)

You mix to prevent exploitation, not to randomize for fun: In Matching Pennies, if agent 2 plays Heads deterministically, agent 1 exploits this by always playing Heads. Agent 2 must mix 50/50 to deny agent 1 any advantage. The equilibrium mixing is chosen to prevent the opponent from exploiting a predictable pattern — not for any intrinsic value of randomization.

Mixed Nash in Matching Pennies

Agent 2 plays Tails with probability p. Drag to see Agent 1's expected utilities for each pure action. The Nash equilibrium is where the lines cross (agent 1 is indifferent between Heads and Tails).

P(Agent 2 plays Tails) 0.50

python
import numpy as np
from itertools import combinations

def find_mixed_nash_2x2(U1, U2):
    """Find all Nash equilibria for 2x2 game.
    U1, U2: 2x2 numpy arrays. Returns list of (pi1, pi2) tuples."""
    m, n = 2, 2
    equilibria = []

    # Check pure Nash (support size 1x1)
    for r in range(m):
        for c in range(n):
            br1 = U1[r,c] == U1[:,c].max()  # agent 1 BR at col c
            br2 = U2[r,c] == U2[r,:].max()  # agent 2 BR at row r
            if br1 and br2:
                pi1 = np.eye(m)[r]; pi2 = np.eye(n)[c]
                equilibria.append((pi1, pi2))

    # Check fully mixed Nash (support size 2x2)
    # Agent 2 mixes so agent 1 is indifferent: U1[0,:] @ pi2 = U1[1,:] @ pi2
    # => (U1[0,0]-U1[1,0])*p + (U1[0,1]-U1[1,1])*(1-p) = 0
    d1 = (U1[0,0] - U1[1,0]) - (U1[0,1] - U1[1,1])
    if abs(d1) > 1e-10:
        p2 = (U1[1,1] - U1[0,1]) / d1  # pi2(action 0)
        d2 = (U2[0,0] - U2[0,1]) - (U2[1,0] - U2[1,1])
        if abs(d2) > 1e-10:
            p1 = (U2[1,1] - U2[1,0]) / d2  # pi1(action 0)
            if 0 <= p1 <= 1 and 0 <= p2 <= 1:
                equilibria.append((np.array([p1, 1-p1]), np.array([p2, 1-p2])))
    return equilibria

In a mixed Nash equilibrium, why does agent j randomize between actions?

To make agent i indifferent between their own actions, preventing exploitation of a predictable pattern Because agent j is uncertain about the game payoffs To maximize the sum of both agents' expected utilities

Chapter 6: Correlated Equilibrium

Nash equilibrium assumes agents choose strategies independently. But what if a trusted mediator could send private recommendations to each agent? This leads to a broader concept: the correlated equilibrium, introduced by Robert Aumann in 1974.

A correlated equilibrium is a joint distribution σ(a) over action profiles such that, for each agent i and each action a_i in the support, following the recommendation is a best response given that all other agents also follow their (private, unseen) recommendations:

∑_{a_−i} σ(a_i, a_−i) [U_i(a_i, a_−i) − U_i(a'_i, a_−i)] ≥ 0 ∀ i, a_i, a'_i

Note: agent i sees only their own recommendation, not the others'. Given this, following it must be rational. This is the obedience constraint.

Traffic lights are correlated equilibria: Two drivers at an intersection. Pure Nash: one goes, one waits (but which one?). Mixed Nash: both randomize, risking crashes. A traffic light is a mediator that coordinates: "you go, you wait." Each driver obeys because, given the other obeys, it's their best response. The light achieves coordination that Nash equilibrium can't guarantee.

Battle of the Sexes — Equilibrium Comparison

Agent 1 prefers (Opera, Opera)=2, Agent 2 prefers (Football, Football)=2. Compare expected utilities under each equilibrium concept. The correlated equilibrium achieves 1.5 for each — beating the mixed Nash (0.67).

Click an equilibrium to see the joint distribution and expected utilities.

CE is easier to compute than Nash: Every Nash equilibrium is a correlated equilibrium (take the product distribution). But CE is a larger, convex set — a polytope defined by linear inequalities (the obedience constraints). Maximizing any linear objective over CE is a linear program, solvable in polynomial time. Finding Nash requires solving polynomial equations — PPAD-hard in general.

What distinguishes a correlated equilibrium from a Nash equilibrium?

A correlated equilibrium allows a joint distribution over action profiles (coordination via private recommendations), not just a product of independent strategies A correlated equilibrium requires all agents to play the same action Correlated equilibria only exist in zero-sum games

Chapter 7: SHOWCASE — Learning to Equilibrium

How do agents reach a Nash equilibrium without a central coordinator? Three algorithms dominate the literature: iterated best response (IBR), fictitious play (FP), and gradient ascent (GA). Each takes a different approach to the same goal: finding the fixed point of mutual best-response.

IBR doesn't always converge: In Matching Pennies, IBR cycles forever — Heads provokes Tails, which provokes Heads, … Convergence is guaranteed only in specific game classes: potential games, supermodular games, dominance-solvable games. Fictitious play and gradient ascent have weaker but broader convergence guarantees.

SHOWCASE: IBR vs. Fictitious Play vs. Gradient Ascent

Pick a game and algorithm. Watch the joint strategy evolve in the [0,1]×[0,1] strategy space. X-axis = P(Agent 1 plays Action 0). Y-axis = P(Agent 2 plays Action 0). Gold star = Nash equilibrium. Trail shows history.

Step 0 P1: [0.50, 0.50] P2: [0.50, 0.50]

Fictitious Play

Each agent maintains counts of opponent past actions. At each step, best-respond to the empirical frequency. The current strategy may cycle, but the time-averaged strategy converges to Nash in zero-sum and potential games.

Gradient Ascent

Each agent updates their mixed strategy in the direction of the gradient of expected utility. Project back onto the simplex after each step. Time-average converges in zero-sum games. In general-sum games, can spiral or diverge.

Convergence summary: IBR converges in potential games and dominance-solvable games. Fictitious play time-average converges in zero-sum, certain potential games, and 2×2 general-sum games. Gradient ascent time-average converges in zero-sum games (multiplicative weights = no-regret). None of the three converge reliably in general-sum games.

Chapter 8: Gradient Ascent in Games

Gradient ascent treats each agent's strategy as a continuous parameter and updates it in the direction of increasing expected utility. It's the game-theoretic analog of gradient descent in optimization — except there's no single objective.

Agent i with mixed strategy π_i ∈ Δ(A_i) updates via:

π_i ← proj_Δ(π_i + α ∇_{π_i} U_i(π_i, π_−i))

The gradient of U_i with respect to π_i(a_i) is simply the expected utility of playing a_i against the current π_−i — i.e., Q_i(a_i, π_−i). So the update concentrates probability mass on high-utility actions.

The proj_Δ operation projects back onto the probability simplex: clamp negatives to 0, then renormalize. This is the only constraint — the parameter is a valid probability distribution.

Gradient ascent is simultaneous, not sequential: Both agents update at the same time. This is fundamentally different from optimization: there's no single loss function being minimized. The gradient of one agent's utility depends on the other's current strategy, which is also changing. This creates oscillation, cycles, and spirals that single-agent optimization never exhibits.

python
import numpy as np

def gradient_ascent(U1, U2, alpha=0.02, T=500):
    """Gradient ascent for 2-player general-sum game.
    U1, U2: (m x n) payoff matrices.
    Returns list of (pi1, pi2) pairs (the trajectory)."""
    m, n = U1.shape
    pi1 = np.ones(m) / m   # start uniform
    pi2 = np.ones(n) / n

    history = [(pi1.copy(), pi2.copy())]
    for _ in range(T):
        # Gradient: Q_i(a_i, pi_{-i}) for each action a_i
        g1 = U1 @ pi2   # shape (m,): EU of each of agent 1's actions
        g2 = U2.T @ pi1  # shape (n,): EU of each of agent 2's actions

        pi1 = pi1 + alpha * g1
        pi2 = pi2 + alpha * g2

        # Project onto simplex: clip, renormalize
        pi1 = np.maximum(pi1, 1e-10); pi1 /= pi1.sum()
        pi2 = np.maximum(pi2, 1e-10); pi2 /= pi2.sum()
        history.append((pi1.copy(), pi2.copy()))

    return history

def fictitious_play(U1, U2, T=500):
    """Fictitious play: best-respond to empirical opponent frequency."""
    m, n = U1.shape
    counts1 = np.ones(m)   # Laplace smoothing
    counts2 = np.ones(n)
    history = []
    for t in range(T):
        freq1 = counts1 / counts1.sum()
        freq2 = counts2 / counts2.sum()
        # Best responses
        a1 = np.argmax(U1 @ freq2)
        a2 = np.argmax(U2.T @ freq1)
        counts1[a1] += 1
        counts2[a2] += 1
        history.append((freq1.copy(), freq2.copy()))
    return history

Multiplicative weights = no-regret = time-averaged Nash in zero-sum: Gradient ascent in zero-sum games is equivalent to the multiplicative weights algorithm, which has the no-regret property: the time-averaged strategy achieves expected utility ≥ best fixed strategy minus O(1/√T). By a minimax argument, the pair of time-averaged strategies forms an ε-Nash equilibrium with ε = O(1/√T).

In gradient ascent for games, what is ∇_{π_i(a_i)} U_i equal to?

The expected utility of playing action a_i against the current opponent strategy π_−i The Nash equilibrium probability of action a_i The marginal increase in utility from playing a_i one more time

Chapter 9: Connections & Beyond

Normal form games are the foundation, but real-world multiagent problems are sequential, partially observable, and involve many agents. The concepts from this chapter extend in several important directions covered next.

Concept (Ch 24)	Limitation	Extension
Normal form games	Single-shot, simultaneous, full info	Markov games (Ch 25), extensive form
Pure/mixed Nash	Multiple equilibria, hard to compute (PPAD)	Correlated eq. (LP), approximate Nash
Fictitious play / GA	Convergence only in special classes	No-regret learning, CFR
2-player games	Doesn't scale to many agents	Mean field games, population games
Independent utilities	No mechanism to coordinate	Mechanism design, auction theory

PPAD-hardness of Nash (Daskalakis, Goldberg, Papadimitriou 2006): Finding a Nash equilibrium in a general 2-player game is PPAD-complete — as hard as finding a Brouwer fixed point. This means no polynomial-time algorithm is known (unless PPAD = P, which is highly unlikely). This computational hardness motivates approximate methods: gradient ascent, fictitious play, and correlated equilibrium (which is LP-solvable).

Chapter 25: Sequential Problems

When the game repeats across time with state transitions, we get Markov games. Nash equilibrium generalizes to Nash Q-functions. Fictitious play and gradient ascent carry over using Q-value estimates instead of utility lookups.

Chapter 27: Collaborative Agents

When all agents share a reward function (Dec-POMDPs), the multiagent problem reduces to coordination under partial observability. The "game" perspective dissolves — there's no conflict, only the challenge of acting well without full communication.

No-regret learning and Nash: If all agents in a game run no-regret algorithms (e.g., multiplicative weights), the time-averaged joint strategy converges to the set of correlated equilibria. In zero-sum games, this is equivalent to Nash. This is a striking result: decentralized, individual rationality leads to a globally stable outcome.

"Nash equilibrium is not a prescription for how to play — it is a description of what rational play must look like if it is to be stable against unilateral deviation."
— Paraphrase, Osborne & Rubinstein, A Course in Game Theory (1994)

Why does the computational hardness of Nash equilibrium motivate correlated equilibrium as a practical alternative?

Finding a correlated equilibrium is a linear program (polynomial time), while Nash is PPAD-hard; and every Nash equilibrium is a correlated equilibrium Correlated equilibrium gives higher utility than Nash in all games Correlated equilibrium is unique while Nash may not exist

← Ch 23: Controller Abstractions Index Ch 25: Sequential Problems →