Kochenderfer, Wheeler & Wray, Chapter 26

State Uncertainty

When multiple agents can't see the full state, the game becomes a partially observable Markov game — combining the complexity of POMDPs with the strategic interaction of multiagent systems.

Prerequisites: POMDPs (Ch 19–20) + Markov games (Ch 25). Beliefs and Nash must be familiar.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Blind Players

Two spies are communicating through coded messages. Neither can see the other's location. Neither knows if their messages are getting through. Each must decide when to act, when to wait — without ever observing the full situation.

This is what happens when you combine partial observability with multiagent interaction. In a single-agent POMDP (Ch 19–22), you maintain a belief over the world state and plan accordingly. When a second agent enters, they also have incomplete information — and your uncertainty is compounded by theirs.

A partially observable Markov game (POMG) is what you get when multiple agents with private observations share a Markov game. Each agent sees only their own observation, not the full state. The strategic calculation must account for both physical uncertainty (what's the state?) and social uncertainty (what does the other agent believe about the state?).

Two kinds of uncertainty stack: In a POMDP, uncertainty is about the world. In a POMG, uncertainty is about the world AND about what other agents know about the world AND about what other agents know you know. This nesting can be infinite. Managing it tractably is the central challenge of this chapter.
Uncertainty Stacking Visualization

Compare the belief hierarchies across settings. Click each to see how the epistemic structure grows.

Click a model to see its epistemic structure.
Why POMGs are strictly harder than either POMDPs or Markov games: A POMDP has a belief state b ∈ Δ(S). You can plan over beliefs. A Markov game requires Nash equilibrium over policies. A POMG requires Nash equilibrium over policies over beliefs. The belief state itself depends on the agents' observation histories, which in turn depend on the other agents' actions. Everything is coupled.
What makes a partially observable Markov game strictly harder than a Markov game?

Chapter 1: POMG Formalism

A partially observable Markov game (POMG) for n agents is (I, S, A1, …, An, T, R1, …, Rn, Ω1, …, Ωn, O, γ):

ComponentNotationMeaning
Agents, states, actionsI, S, AiSame as Markov game
TransitionT(s'|s, a)Next state given joint action
RewardsRi(s, a)Agent i's reward
Observation spacesΩiPrivate observations for each agent i
Observation modelO(o1, …, on|s', a)Joint observation probability
DiscountγTime preference

After each step, the world moves to s' ~ T(·|s,a). Each agent i receives a private observation oi ~ Oi(·|s',a). Agent i's policy maps their observation history to an action: πi : (oi,1, ai,1, …, oi,t) → Δ(Ai). Note: agent i does not see the observations of other agents.

Observations can be conditionally independent: Often O(o1, …, on|s', a) = ∏i Oi(oi|s', a). Each agent's noisy sensor independently reads the world state. This factored observation model is almost universal in practice — it avoids modeling correlations between agents' sensors. The joint observation model is needed in theory to define belief updates; conditional independence simplifies computation.

Special Case: Dec-POMDP

When all agents share the same reward function (R1 = … = Rn = R), the POMG becomes a Dec-POMDP (Ch 27). No game theory needed — it's purely a planning problem. But it's NEXP-complete.

Special Case: Zero-Sum POMG

When R1 = −R2, the POMG is a two-player zero-sum game with partial observability. Minimax is the right solution concept. This includes most adversarial games (e.g., poker, hide-and-seek).

In a POMG, what does agent i's policy map from and to?

Chapter 2: Private Beliefs

In a POMDP, there's one belief state b(s) = P(s | o1, a1, …, ot). In a POMG, each agent maintains their own private belief bi(s) = P(s | oi,1, ai,1, …, oi,t). Agent i conditions only on their own history, not the others'.

Agent i's belief update after taking action ai and receiving observation oi is a Bayesian update:

bi'(s') ∝ Oi(oi|s', a) ∑s T(s'|s, a) bi(s)

But wait — this update requires conditioning on the joint action a. Agent i knows their own action ai, but must model the other agents' actions a−i. Without knowing the others' policies, agent i must marginalize over their opponents' strategies to compute the update:

bi'(s') ∝ ∑a−i π−i(a−i|·) ċ Oi(oi|s', (ai, a−i)) ∑s T(s'|s, (ai, a−i)) bi(s)
The belief depends on the opponent's policy: Agent i's belief update requires knowing π−i — the other agents' strategies. But π−i is what we're trying to find (it's part of the equilibrium). This circularity is the core difficulty: the belief and the equilibrium strategy are mutually dependent. Solution: assume a model of opponents' policies and solve for self-consistency.
Private Belief Update Comparison

Two agents, each with private observations. Agent 1's belief (teal) and Agent 2's belief (orange) diverge over time even from the same true state. Click "Step" to advance.

t=0. True state: S1.
Common knowledge assumption: Many POMG algorithms assume agents share a common prior b0 and common knowledge of the game (T, R, O). With a common prior and known policies, each agent can compute what the other's belief must be — even without directly observing their observations. This is the basis for interactive belief modeling (I-POMDPs).
Why does agent i's private belief update depend on the other agents' policies?

Chapter 3: Policy Evaluation

Given a joint policy (π1, …, πn), how do we compute each agent's expected value? In a fully observable Markov game, this was a system of linear equations. In a POMG, policies depend on observation histories, which grow exponentially — so we need a different approach.

For finite-horizon POMGs (T steps), we can evaluate a joint policy by unrolling the observation history tree. The joint observation history at time t is ht = (a1, o1, …, at−1, ot) where each o is a joint observation. The number of nodes in this tree is |A|t|Ω|t — exponential in t.

Viπ(b0) = Eπ[∑t=0T γt Ri(st, at) | b0]

The expectation is over the joint stochastic process defined by T, O, and the policy π. We compute it by summing over all possible observation histories, weighting by their probability under the joint policy.

The tree is exponentially large: At each step, the observation history branches by |Ω1| × |Ω2| × … × |Ωn|. After T steps, the number of leaf nodes is (|A|·|Ω|)T. For even modest problems (|A|=4, |Ω|=4, T=10), this is 410 = 106 nodes — and this is just for policy evaluation, not optimization. This is why POMG is strictly harder than POMDP (already PSPACE-complete) and harder than Markov games.
python
def evaluate_joint_policy(pomg, pi, b0, horizon):
    """Monte Carlo evaluation of joint policy pi in POMG.
    pomg: POMGame object (step, reward methods).
    pi: list of n policy functions pi[i](obs_history, action_history).
    b0: initial belief (distribution over states).
    Returns estimated value for each agent."""
    n = pomg.n_agents
    total_values = [0.0] * n
    n_rollouts = 200

    for _ in range(n_rollouts):
        s = pomg.sample_state(b0)  # sample initial state
        obs_hist = [[] for _ in range(n)]
        act_hist = [[] for _ in range(n)]
        discount = 1.0

        for t in range(horizon):
            # Each agent chooses action from their private history
            actions = [pi[i](obs_hist[i], act_hist[i]) for i in range(n)]
            # Step the environment
            sp, obs, rewards = pomg.step(s, actions)
            for i in range(n):
                total_values[i] += discount * rewards[i]
                obs_hist[i].append(obs[i])
                act_hist[i].append(actions[i])
            s = sp; discount *= pomg.gamma

    return [v / n_rollouts for v in total_values]
Why does finite-horizon POMG policy evaluation have complexity exponential in the horizon T?

Chapter 4: Nash Equilibrium for POMGs

A joint policy (π1*, …, πn*) is a Nash equilibrium for a POMG if no agent can improve their expected value by unilaterally changing their behavioral policy, given the other agents' equilibrium policies:

Viπ*(b0) ≥ Vii, π−i*)(b0)   ∀ i, πi

Note: πi here is a behavioral policy — it maps each possible private history to an action distribution. This is a much richer strategy space than normal form strategies (single probability vector) or Markov policies (state-indexed).

Behavioral vs. mixed strategies in extensive form games: In game theory, a POMG can be written as an extensive form game. Kuhn's theorem says that in games with perfect recall, behavioral strategies (randomize at each information set) are equivalent to mixed strategies (randomize over deterministic plans). This means finding a behavioral Nash is equivalent to finding a mixed Nash — but the strategy space is exponentially larger.
Finite games have Nash equilibria: Since POMGs have finite agents, finite actions, and observation histories form a finite tree (for finite horizon), the induced extensive form game is finite. Nash's theorem guarantees at least one Nash equilibrium in behavioral strategies. For infinite horizon, the strategy space is infinite-dimensional and existence requires more care (functional analysis).
Poker: A Zero-Sum POMG

Simplified 2-card poker: each player is dealt one card (High/Low) privately. Player 1 can Bet or Check. Drag to see the Nash equilibrium bluffing strategy — the optimal mix of betting with low cards to keep the opponent from exploiting pure strategies.

P(Player 1 bluffs with Low card) 0.33
What does "behavioral policy" mean in the context of POMG Nash equilibrium?

Chapter 5: Dynamic Programming

For finite-horizon POMGs, dynamic programming provides an exact backward induction algorithm. Starting from the last time step and working backwards, we compute the optimal behavioral strategies at each decision point.

Define the reachability of an observation history h under joint policy π and initial belief b0. The backward induction computes, at each observation node, the Nash equilibrium of the stage game with continuation values as payoffs.

The backward induction tree: At the last step T, each agent faces a one-shot normal form game (the stage game payoff is just Ri, no continuation). We solve for Nash. At step T−1, the continuation value from step T becomes part of the payoff. We fold these values back and solve the stage game at T−1. Repeat. This is exactly like backward induction in extensive form game theory.
Base case (t=T)
At each leaf of the observation tree, solve the stage game with payoffs Ri(s,a) weighted by belief. Get Nash strategies (π*T) and Nash values VTi.
↑ backward induction
Step (t<T)
Payoff = Ri + γ ∑ T O Vt+1i. Solve stage game. Get (π*t, Vti).
↑ backward induction
Base case (t=0)
V0i(b0) is the POMG value. π* is the Nash equilibrium policy for each agent.
Why the tree is exponentially large: At each time step, observations branch. After t steps, the number of distinct observation histories is |Ω1|t × |Ω2|t. The backward induction tree has exponentially many nodes. For t=5, |Ω|=4, 2 agents: 410 ≈ 106 nodes. For horizon 20: intractable without approximation. This is why approximations (alpha vectors, MCTS, PBVI) from Chs 20–22 are needed here too.
python
def backward_induction_pomg(pomg, b0, horizon):
    """Exact backward induction for finite-horizon 2-player POMG.
    Returns Nash equilibrium strategies and value."""
    def solve_node(belief, t, hist1, hist2):
        # Stage game payoffs: integrate over states weighted by belief
        U1 = np.zeros((pomg.nA1, pomg.nA2))
        U2 = np.zeros((pomg.nA1, pomg.nA2))
        for a1 in range(pomg.nA1):
            for a2 in range(pomg.nA2):
                for s, bs in enumerate(belief):
                    r1 = pomg.R1[s, a1, a2]; r2 = pomg.R2[s, a1, a2]
                    if t < horizon:
                        # Continuation: sum over next states and obs
                        for sp, o1, o2, prob in pomg.transitions(s, a1, a2):
                            b_next = pomg.belief_update(belief, a1, a2, o1, o2)
                            v1, v2, _, _ = solve_node(b_next, t+1, hist1+[o1], hist2+[o2])
                            r1 += pomg.gamma * prob * v1
                            r2 += pomg.gamma * prob * v2
                    U1[a1, a2] += bs * r1
                    U2[a1, a2] += bs * r2
        pi1, pi2 = solve_nash_2x2(U1, U2)
        v1 = pi1 @ U1 @ pi2; v2 = pi1 @ U2 @ pi2
        return v1, v2, pi1, pi2
    return solve_node(b0, 0, [], [])
In backward induction for a finite-horizon POMG, what serves as the "payoff" for the stage game at step t<T?

Chapter 6: I-POMDPs

Instead of assuming agents reason to Nash equilibrium globally, Interactive POMDPs (I-POMDPs) (Gmytrasiewicz & Doshi, 2005) model agent i's planning as a POMDP where the state space includes the beliefs and intentions of other agents.

In an I-POMDP for agent i, the "augmented state" is (s, θ−i) where s is the physical state and θ−i is a model of each opponent j — specifically, j's type, including j's belief about the world and j's strategy.

bi(s, θ−i) = P(\text{world is s, opponent has type } θ−i)

Agent i maintains a belief over both the physical state and the opponent's type. When i observes (ai, oi), they update this joint belief using Bayes' rule. This is like a POMDP with an augmented state space S × Θ−i.

The level-k hierarchy in I-POMDPs: The type space Θj contains "models of j." But a model of j includes j's belief about i, which in turn contains i's model of j, … This nesting is infinite. I-POMDPs truncate at some level k: level-0 models have fixed strategies; level-k models are I-POMDPs where opponents are modeled as level-(k−1). Computationally, solving a level-k I-POMDP requires solving a level-(k−1) I-POMDP at each step.
Relationship to Nash: A Nash equilibrium of a POMG corresponds to a fixed point of the I-POMDP belief hierarchy: agent i's model of j is exactly j's actual I-POMDP policy, and vice versa. In practice, I-POMDPs don't require computing this fixed point; they allow agent i to model opponents with bounded rationality (finite level-k) and plan accordingly. This is often more realistic than assuming full Nash equilibrium rationality.
What does the "augmented state" in an I-POMDP include beyond the physical state s?

Chapter 7: SHOWCASE — Pursuit Game

A pursuer (orange) tries to catch an evader (teal) in a grid. Both have noisy sensors — they observe approximate positions, not exact ones. This is a zero-sum POMG: the pursuer wants to minimize time to catch; the evader wants to maximize it.

Partial observability changes the game: With full observability, the evader simply runs to the corner furthest from the pursuer (computable by minimax Markov game). With partial observability, the evader must also hide its location information and the pursuer must estimate where the evader is. The optimal strategies involve deception: the evader takes seemingly random actions to inject noise into the pursuer's belief.
SHOWCASE: Partially Observable Pursuit

Pursuer (orange) sees noisy position of evader. Evader (teal) sees noisy position of pursuer. Both use greedy best-response to their current belief. Drag noise slider and watch how tracking degrades.

Observation noise 0.30
Step 0 Distance: ?
What to observe: At low noise, the pursuer tracks accurately and converges. At high noise, the pursuer's belief spreads and they may go to the wrong location. The evader (randomly moving in this demo) benefits from the pursuer's uncertainty even without explicitly trying to mislead. In a true Nash equilibrium, the evader would exploit this by actively moving unpredictably.

Chapter 8: Approximations

Exact POMG solving is intractable for all but toy problems. The same approximation techniques from single-agent POMDPs (Ch 20–22) extend to the multiagent setting, often with reduced guarantees.

TechniqueSingle-agent (POMDP)Multi-agent (POMG)
Alpha vectorsExact (Ch 20)Per-agent alpha vectors; no convergence guarantee
Point-based (PBVI)Lower bound (Ch 21)Can be extended; Nash stage game at each belief point
Online MCTS (POMCP)Asymptotically optimal (Ch 22)MCTS with Nash rollouts; practical for zero-sum
Policy gradientConverges (Ch 11)Non-stationary; convergence requires conditions
Controllers (FSC)Finite memory (Ch 23)Per-agent FSC; optimize jointly or via gradient
The counterfactual regret (CFR) approach for zero-sum POMGs: For 2-player zero-sum POMGs (e.g., poker), CFR is the dominant practical method. CFR maintains regret for each decision node in the observation tree, updates strategies to minimize regret, and converges (time-average) to Nash equilibrium. DeepCFR and ReBeL extend this to large games using neural network approximations. This is state-of-the-art for poker AI.
Mean field approximation for large n: When the number of agents is large (n → ∞), the influence of any individual agent on the population becomes negligible. Mean field game theory approximates the n-agent POMG as a single-agent POMDP against a mean field (aggregate distribution) of opponents. This reduces exponential complexity to linear while maintaining good approximation guarantees for symmetric, large-population games.
Which approximation method dominates practical large-scale zero-sum POMGs like poker?

Chapter 9: Connections & Beyond

POMGs sit at the intersection of game theory and partially observable planning. The two key extensions from this chapter are Dec-POMDPs (Ch 27) for fully cooperative settings, and I-POMDPs for bounded-rationality opponent modeling.

ModelObservabilityUtilitiesComplexity
MDPFullSingle agentP
POMDPPartialSingle agentPSPACE
Markov gameFullMulti-agentPPAD (Nash)
POMGPartialMulti-agentPSPACE + PPAD
Dec-POMDPPartialShared rewardNEXP-complete
Why Dec-POMDP is harder than POMG (general-sum): Counterintuitively, the fully cooperative Dec-POMDP is NEXP-complete while the general-sum POMG has no known tighter bound than "PSPACE+PPAD." The reason: in a general-sum POMG, agents have independent objectives and can reason independently (each agent's problem reduces to a POMDP against a fixed opponent policy). In a Dec-POMDP, agents share a reward but can't communicate — they must implicitly coordinate, creating a combinatorial explosion in the joint policy space.
"In the long run, it may turn out that partial observability is the most important aspect of multiagent systems — more important than the number of agents, more important than the reward structure."
— Paraphrase, Oliehoek & Amato, A Concise Introduction to Decentralized POMDPs (2016)
Why is the cooperative Dec-POMDP (NEXP-complete) harder than the general-sum POMG?
← Ch 25: Sequential Problems Index Ch 27: Collaborative Agents →