Ch 26: State Uncertainty — Algorithms for Decision Making

Chapter 0: Blind Players

Two spies are communicating through coded messages. Neither can see the other's location. Neither knows if their messages are getting through. Each must decide when to act, when to wait — without ever observing the full situation.

This is what happens when you combine partial observability with multiagent interaction. In a single-agent POMDP (Ch 19–22), you maintain a belief over the world state and plan accordingly. When a second agent enters, they also have incomplete information — and your uncertainty is compounded by theirs.

A partially observable Markov game (POMG) is what you get when multiple agents with private observations share a Markov game. Each agent sees only their own observation, not the full state. The strategic calculation must account for both physical uncertainty (what's the state?) and social uncertainty (what does the other agent believe about the state?).

Two kinds of uncertainty stack: In a POMDP, uncertainty is about the world. In a POMG, uncertainty is about the world AND about what other agents know about the world AND about what other agents know you know. This nesting can be infinite. Managing it tractably is the central challenge of this chapter.

Uncertainty Stacking Visualization

Compare the belief hierarchies across settings. Click each to see how the epistemic structure grows.

Click a model to see its epistemic structure.

Why POMGs are strictly harder than either POMDPs or Markov games: A POMDP has a belief state b ∈ Δ(S). You can plan over beliefs. A Markov game requires Nash equilibrium over policies. A POMG requires Nash equilibrium over policies over beliefs. The belief state itself depends on the agents' observation histories, which in turn depend on the other agents' actions. Everything is coupled.

What makes a partially observable Markov game strictly harder than a Markov game?

Each agent must reason about the world state and about what other agents know — beliefs about beliefs compound the uncertainty POMGs have larger action spaces POMGs require continuous state spaces

Chapter 1: POMG Formalism

A partially observable Markov game (POMG) for n agents is (I, S, A₁, …, A_n, T, R₁, …, R_n, Ω₁, …, Ω_n, O, γ):

Component	Notation	Meaning
Agents, states, actions	I, S, A_i	Same as Markov game
Transition	T(s'\|s, a)	Next state given joint action
Rewards	R_i(s, a)	Agent i's reward
Observation spaces	Ω_i	Private observations for each agent i
Observation model	O(o₁, …, o_n\|s', a)	Joint observation probability
Discount	γ	Time preference

After each step, the world moves to s' ~ T(·|s,a). Each agent i receives a private observation o_i ~ O_i(·|s',a). Agent i's policy maps their observation history to an action: π_i : (o_i,1, a_i,1, …, o_i,t) → Δ(A_i). Note: agent i does not see the observations of other agents.

Observations can be conditionally independent: Often O(o₁, …, o_n|s', a) = ∏_i O_i(o_i|s', a). Each agent's noisy sensor independently reads the world state. This factored observation model is almost universal in practice — it avoids modeling correlations between agents' sensors. The joint observation model is needed in theory to define belief updates; conditional independence simplifies computation.

Special Case: Dec-POMDP

When all agents share the same reward function (R₁ = … = R_n = R), the POMG becomes a Dec-POMDP (Ch 27). No game theory needed — it's purely a planning problem. But it's NEXP-complete.

Special Case: Zero-Sum POMG

When R₁ = −R₂, the POMG is a two-player zero-sum game with partial observability. Minimax is the right solution concept. This includes most adversarial games (e.g., poker, hide-and-seek).

In a POMG, what does agent i's policy map from and to?

From the private observation history (o_i,1, a_i,1, …, o_i,t) to a distribution over actions From the full joint observation (o₁, …, o_n) to an action From the world state s to an action distribution

Chapter 2: Private Beliefs

In a POMDP, there's one belief state b(s) = P(s | o₁, a₁, …, o_t). In a POMG, each agent maintains their own private belief b_i(s) = P(s | o_i,1, a_i,1, …, o_i,t). Agent i conditions only on their own history, not the others'.

Agent i's belief update after taking action a_i and receiving observation o_i is a Bayesian update:

b_i'(s') ∝ O_i(o_i|s', a) ∑_s T(s'|s, a) b_i(s)

But wait — this update requires conditioning on the joint action a. Agent i knows their own action a_i, but must model the other agents' actions a_−i. Without knowing the others' policies, agent i must marginalize over their opponents' strategies to compute the update:

b_i'(s') ∝ ∑_{a_−i} π_−i(a_−i|·) ċ O_i(o_i|s', (a_i, a_−i)) ∑_s T(s'|s, (a_i, a_−i)) b_i(s)

The belief depends on the opponent's policy: Agent i's belief update requires knowing π_−i — the other agents' strategies. But π_−i is what we're trying to find (it's part of the equilibrium). This circularity is the core difficulty: the belief and the equilibrium strategy are mutually dependent. Solution: assume a model of opponents' policies and solve for self-consistency.

Private Belief Update Comparison

Two agents, each with private observations. Agent 1's belief (teal) and Agent 2's belief (orange) diverge over time even from the same true state. Click "Step" to advance.

t=0. True state: S1.

Common knowledge assumption: Many POMG algorithms assume agents share a common prior b₀ and common knowledge of the game (T, R, O). With a common prior and known policies, each agent can compute what the other's belief must be — even without directly observing their observations. This is the basis for interactive belief modeling (I-POMDPs).

Why does agent i's private belief update depend on the other agents' policies?

Because the transition and observation probabilities condition on the joint action, so agent i must marginalize over the other agents' action distributions Because agent i can observe the other agents' observations Because the state space includes the other agents' beliefs

Chapter 3: Policy Evaluation

Given a joint policy (π₁, …, π_n), how do we compute each agent's expected value? In a fully observable Markov game, this was a system of linear equations. In a POMG, policies depend on observation histories, which grow exponentially — so we need a different approach.

For finite-horizon POMGs (T steps), we can evaluate a joint policy by unrolling the observation history tree. The joint observation history at time t is h_t = (a₁, o₁, …, a_t−1, o_t) where each o is a joint observation. The number of nodes in this tree is |A|^t|Ω|^t — exponential in t.

V_i^π(b₀) = E_π[∑_t=0^T γ^t R_i(s_t, a_t) | b₀]

The expectation is over the joint stochastic process defined by T, O, and the policy π. We compute it by summing over all possible observation histories, weighting by their probability under the joint policy.

The tree is exponentially large: At each step, the observation history branches by |Ω₁| × |Ω₂| × … × |Ω_n|. After T steps, the number of leaf nodes is (|A|·|Ω|)^T. For even modest problems (|A|=4, |Ω|=4, T=10), this is 4¹⁰ = 10⁶ nodes — and this is just for policy evaluation, not optimization. This is why POMG is strictly harder than POMDP (already PSPACE-complete) and harder than Markov games.

python
def evaluate_joint_policy(pomg, pi, b0, horizon):
    """Monte Carlo evaluation of joint policy pi in POMG.
    pomg: POMGame object (step, reward methods).
    pi: list of n policy functions pi[i](obs_history, action_history).
    b0: initial belief (distribution over states).
    Returns estimated value for each agent."""
    n = pomg.n_agents
    total_values = [0.0] * n
    n_rollouts = 200

    for _ in range(n_rollouts):
        s = pomg.sample_state(b0)  # sample initial state
        obs_hist = [[] for _ in range(n)]
        act_hist = [[] for _ in range(n)]
        discount = 1.0

        for t in range(horizon):
            # Each agent chooses action from their private history
            actions = [pi[i](obs_hist[i], act_hist[i]) for i in range(n)]
            # Step the environment
            sp, obs, rewards = pomg.step(s, actions)
            for i in range(n):
                total_values[i] += discount * rewards[i]
                obs_hist[i].append(obs[i])
                act_hist[i].append(actions[i])
            s = sp; discount *= pomg.gamma

    return [v / n_rollouts for v in total_values]

Why does finite-horizon POMG policy evaluation have complexity exponential in the horizon T?

The number of possible joint observation histories grows as (|A|·|Ω|)^T — the history tree branches at each step by all possible action-observation pairs The state space grows exponentially with the number of agents Beliefs become continuous distributions that can't be represented finitely

Chapter 4: Nash Equilibrium for POMGs

A joint policy (π₁*, …, π_n*) is a Nash equilibrium for a POMG if no agent can improve their expected value by unilaterally changing their behavioral policy, given the other agents' equilibrium policies:

V_i^π*(b₀) ≥ V_i^{(π_i, π_−i*)}(b₀) ∀ i, π_i

Note: π_i here is a behavioral policy — it maps each possible private history to an action distribution. This is a much richer strategy space than normal form strategies (single probability vector) or Markov policies (state-indexed).

Behavioral vs. mixed strategies in extensive form games: In game theory, a POMG can be written as an extensive form game. Kuhn's theorem says that in games with perfect recall, behavioral strategies (randomize at each information set) are equivalent to mixed strategies (randomize over deterministic plans). This means finding a behavioral Nash is equivalent to finding a mixed Nash — but the strategy space is exponentially larger.

Finite games have Nash equilibria: Since POMGs have finite agents, finite actions, and observation histories form a finite tree (for finite horizon), the induced extensive form game is finite. Nash's theorem guarantees at least one Nash equilibrium in behavioral strategies. For infinite horizon, the strategy space is infinite-dimensional and existence requires more care (functional analysis).

Poker: A Zero-Sum POMG

Simplified 2-card poker: each player is dealt one card (High/Low) privately. Player 1 can Bet or Check. Drag to see the Nash equilibrium bluffing strategy — the optimal mix of betting with low cards to keep the opponent from exploiting pure strategies.

P(Player 1 bluffs with Low card) 0.33

What does "behavioral policy" mean in the context of POMG Nash equilibrium?

A policy that maps each private observation history to a distribution over actions (randomizes at each decision point based on observed history) A policy that adapts to the opponents' behavior by observing their actions A policy that mixes randomly over all deterministic policies ahead of time

Chapter 5: Dynamic Programming

For finite-horizon POMGs, dynamic programming provides an exact backward induction algorithm. Starting from the last time step and working backwards, we compute the optimal behavioral strategies at each decision point.

Define the reachability of an observation history h under joint policy π and initial belief b₀. The backward induction computes, at each observation node, the Nash equilibrium of the stage game with continuation values as payoffs.

The backward induction tree: At the last step T, each agent faces a one-shot normal form game (the stage game payoff is just R_i, no continuation). We solve for Nash. At step T−1, the continuation value from step T becomes part of the payoff. We fold these values back and solve the stage game at T−1. Repeat. This is exactly like backward induction in extensive form game theory.

Base case (t=T)

At each leaf of the observation tree, solve the stage game with payoffs R_i(s,a) weighted by belief. Get Nash strategies (π*_T) and Nash values V^T_i.

↑ backward induction

Step (t<T)

Payoff = R_i + γ ∑ T O V^t+1_i. Solve stage game. Get (π*_t, V^t_i).

↑ backward induction

Base case (t=0)

V⁰_i(b₀) is the POMG value. π* is the Nash equilibrium policy for each agent.

Why the tree is exponentially large: At each time step, observations branch. After t steps, the number of distinct observation histories is |Ω₁|^t × |Ω₂|^t. The backward induction tree has exponentially many nodes. For t=5, |Ω|=4, 2 agents: 4¹⁰ ≈ 10⁶ nodes. For horizon 20: intractable without approximation. This is why approximations (alpha vectors, MCTS, PBVI) from Chs 20–22 are needed here too.

python
def backward_induction_pomg(pomg, b0, horizon):
    """Exact backward induction for finite-horizon 2-player POMG.
    Returns Nash equilibrium strategies and value."""
    def solve_node(belief, t, hist1, hist2):
        # Stage game payoffs: integrate over states weighted by belief
        U1 = np.zeros((pomg.nA1, pomg.nA2))
        U2 = np.zeros((pomg.nA1, pomg.nA2))
        for a1 in range(pomg.nA1):
            for a2 in range(pomg.nA2):
                for s, bs in enumerate(belief):
                    r1 = pomg.R1[s, a1, a2]; r2 = pomg.R2[s, a1, a2]
                    if t < horizon:
                        # Continuation: sum over next states and obs
                        for sp, o1, o2, prob in pomg.transitions(s, a1, a2):
                            b_next = pomg.belief_update(belief, a1, a2, o1, o2)
                            v1, v2, _, _ = solve_node(b_next, t+1, hist1+[o1], hist2+[o2])
                            r1 += pomg.gamma * prob * v1
                            r2 += pomg.gamma * prob * v2
                    U1[a1, a2] += bs * r1
                    U2[a1, a2] += bs * r2
        pi1, pi2 = solve_nash_2x2(U1, U2)
        v1 = pi1 @ U1 @ pi2; v2 = pi1 @ U2 @ pi2
        return v1, v2, pi1, pi2
    return solve_node(b0, 0, [], [])

In backward induction for a finite-horizon POMG, what serves as the "payoff" for the stage game at step t<T?

The immediate reward R_i plus discounted expected future value from the Nash equilibrium of subsequent steps Just the immediate reward R_i(s, a) The Nash equilibrium value of the normal form game induced at the final step T

Chapter 6: I-POMDPs

Instead of assuming agents reason to Nash equilibrium globally, Interactive POMDPs (I-POMDPs) (Gmytrasiewicz & Doshi, 2005) model agent i's planning as a POMDP where the state space includes the beliefs and intentions of other agents.

In an I-POMDP for agent i, the "augmented state" is (s, θ_−i) where s is the physical state and θ_−i is a model of each opponent j — specifically, j's type, including j's belief about the world and j's strategy.

b_i(s, θ_−i) = P(\text{world is s, opponent has type } θ_−i)

Agent i maintains a belief over both the physical state and the opponent's type. When i observes (a_i, o_i), they update this joint belief using Bayes' rule. This is like a POMDP with an augmented state space S × Θ_−i.

The level-k hierarchy in I-POMDPs: The type space Θ_j contains "models of j." But a model of j includes j's belief about i, which in turn contains i's model of j, … This nesting is infinite. I-POMDPs truncate at some level k: level-0 models have fixed strategies; level-k models are I-POMDPs where opponents are modeled as level-(k−1). Computationally, solving a level-k I-POMDP requires solving a level-(k−1) I-POMDP at each step.

Relationship to Nash: A Nash equilibrium of a POMG corresponds to a fixed point of the I-POMDP belief hierarchy: agent i's model of j is exactly j's actual I-POMDP policy, and vice versa. In practice, I-POMDPs don't require computing this fixed point; they allow agent i to model opponents with bounded rationality (finite level-k) and plan accordingly. This is often more realistic than assuming full Nash equilibrium rationality.

What does the "augmented state" in an I-POMDP include beyond the physical state s?

A model of each opponent's type (their beliefs, strategy, and level of reasoning) The joint observation history of all agents The Nash equilibrium strategy profile for the current state

Chapter 7: SHOWCASE — Pursuit Game

A pursuer (orange) tries to catch an evader (teal) in a grid. Both have noisy sensors — they observe approximate positions, not exact ones. This is a zero-sum POMG: the pursuer wants to minimize time to catch; the evader wants to maximize it.

Partial observability changes the game: With full observability, the evader simply runs to the corner furthest from the pursuer (computable by minimax Markov game). With partial observability, the evader must also hide its location information and the pursuer must estimate where the evader is. The optimal strategies involve deception: the evader takes seemingly random actions to inject noise into the pursuer's belief.

SHOWCASE: Partially Observable Pursuit

Pursuer (orange) sees noisy position of evader. Evader (teal) sees noisy position of pursuer. Both use greedy best-response to their current belief. Drag noise slider and watch how tracking degrades.

Observation noise 0.30

Step 0 Distance: ?

What to observe: At low noise, the pursuer tracks accurately and converges. At high noise, the pursuer's belief spreads and they may go to the wrong location. The evader (randomly moving in this demo) benefits from the pursuer's uncertainty even without explicitly trying to mislead. In a true Nash equilibrium, the evader would exploit this by actively moving unpredictably.

Chapter 8: Approximations

Exact POMG solving is intractable for all but toy problems. The same approximation techniques from single-agent POMDPs (Ch 20–22) extend to the multiagent setting, often with reduced guarantees.

Technique	Single-agent (POMDP)	Multi-agent (POMG)
Alpha vectors	Exact (Ch 20)	Per-agent alpha vectors; no convergence guarantee
Point-based (PBVI)	Lower bound (Ch 21)	Can be extended; Nash stage game at each belief point
Online MCTS (POMCP)	Asymptotically optimal (Ch 22)	MCTS with Nash rollouts; practical for zero-sum
Policy gradient	Converges (Ch 11)	Non-stationary; convergence requires conditions
Controllers (FSC)	Finite memory (Ch 23)	Per-agent FSC; optimize jointly or via gradient

The counterfactual regret (CFR) approach for zero-sum POMGs: For 2-player zero-sum POMGs (e.g., poker), CFR is the dominant practical method. CFR maintains regret for each decision node in the observation tree, updates strategies to minimize regret, and converges (time-average) to Nash equilibrium. DeepCFR and ReBeL extend this to large games using neural network approximations. This is state-of-the-art for poker AI.

Mean field approximation for large n: When the number of agents is large (n → ∞), the influence of any individual agent on the population becomes negligible. Mean field game theory approximates the n-agent POMG as a single-agent POMDP against a mean field (aggregate distribution) of opponents. This reduces exponential complexity to linear while maintaining good approximation guarantees for symmetric, large-population games.

Which approximation method dominates practical large-scale zero-sum POMGs like poker?

Counterfactual Regret (CFR) — minimizes regret at each decision node in the observation tree, converging (time-average) to Nash PBVI — point-based value iteration over the belief simplex Nash Q-learning from Ch 25

Chapter 9: Connections & Beyond

POMGs sit at the intersection of game theory and partially observable planning. The two key extensions from this chapter are Dec-POMDPs (Ch 27) for fully cooperative settings, and I-POMDPs for bounded-rationality opponent modeling.

Model	Observability	Utilities	Complexity
MDP	Full	Single agent	P
POMDP	Partial	Single agent	PSPACE
Markov game	Full	Multi-agent	PPAD (Nash)
POMG	Partial	Multi-agent	PSPACE + PPAD
Dec-POMDP	Partial	Shared reward	NEXP-complete

Why Dec-POMDP is harder than POMG (general-sum): Counterintuitively, the fully cooperative Dec-POMDP is NEXP-complete while the general-sum POMG has no known tighter bound than "PSPACE+PPAD." The reason: in a general-sum POMG, agents have independent objectives and can reason independently (each agent's problem reduces to a POMDP against a fixed opponent policy). In a Dec-POMDP, agents share a reward but can't communicate — they must implicitly coordinate, creating a combinatorial explosion in the joint policy space.

"In the long run, it may turn out that partial observability is the most important aspect of multiagent systems — more important than the number of agents, more important than the reward structure."
— Paraphrase, Oliehoek & Amato, A Concise Introduction to Decentralized POMDPs (2016)

Why is the cooperative Dec-POMDP (NEXP-complete) harder than the general-sum POMG?

In a Dec-POMDP, agents must implicitly coordinate to maximize shared reward without communication, creating a combinatorial joint-policy explosion; in a general-sum POMG, each agent has independent objectives and can reason separately Dec-POMDPs have larger state spaces General-sum POMGs allow agents to communicate

← Ch 25: Sequential Problems Index Ch 27: Collaborative Agents →