When multiple agents can't see the full state, the game becomes a partially observable Markov game — combining the complexity of POMDPs with the strategic interaction of multiagent systems.
Two spies are communicating through coded messages. Neither can see the other's location. Neither knows if their messages are getting through. Each must decide when to act, when to wait — without ever observing the full situation.
This is what happens when you combine partial observability with multiagent interaction. In a single-agent POMDP (Ch 19–22), you maintain a belief over the world state and plan accordingly. When a second agent enters, they also have incomplete information — and your uncertainty is compounded by theirs.
A partially observable Markov game (POMG) is what you get when multiple agents with private observations share a Markov game. Each agent sees only their own observation, not the full state. The strategic calculation must account for both physical uncertainty (what's the state?) and social uncertainty (what does the other agent believe about the state?).
Compare the belief hierarchies across settings. Click each to see how the epistemic structure grows.
A partially observable Markov game (POMG) for n agents is (I, S, A1, …, An, T, R1, …, Rn, Ω1, …, Ωn, O, γ):
| Component | Notation | Meaning |
|---|---|---|
| Agents, states, actions | I, S, Ai | Same as Markov game |
| Transition | T(s'|s, a) | Next state given joint action |
| Rewards | Ri(s, a) | Agent i's reward |
| Observation spaces | Ωi | Private observations for each agent i |
| Observation model | O(o1, …, on|s', a) | Joint observation probability |
| Discount | γ | Time preference |
After each step, the world moves to s' ~ T(·|s,a). Each agent i receives a private observation oi ~ Oi(·|s',a). Agent i's policy maps their observation history to an action: πi : (oi,1, ai,1, …, oi,t) → Δ(Ai). Note: agent i does not see the observations of other agents.
When all agents share the same reward function (R1 = … = Rn = R), the POMG becomes a Dec-POMDP (Ch 27). No game theory needed — it's purely a planning problem. But it's NEXP-complete.
When R1 = −R2, the POMG is a two-player zero-sum game with partial observability. Minimax is the right solution concept. This includes most adversarial games (e.g., poker, hide-and-seek).
In a POMDP, there's one belief state b(s) = P(s | o1, a1, …, ot). In a POMG, each agent maintains their own private belief bi(s) = P(s | oi,1, ai,1, …, oi,t). Agent i conditions only on their own history, not the others'.
Agent i's belief update after taking action ai and receiving observation oi is a Bayesian update:
But wait — this update requires conditioning on the joint action a. Agent i knows their own action ai, but must model the other agents' actions a−i. Without knowing the others' policies, agent i must marginalize over their opponents' strategies to compute the update:
Two agents, each with private observations. Agent 1's belief (teal) and Agent 2's belief (orange) diverge over time even from the same true state. Click "Step" to advance.
Given a joint policy (π1, …, πn), how do we compute each agent's expected value? In a fully observable Markov game, this was a system of linear equations. In a POMG, policies depend on observation histories, which grow exponentially — so we need a different approach.
For finite-horizon POMGs (T steps), we can evaluate a joint policy by unrolling the observation history tree. The joint observation history at time t is ht = (a1, o1, …, at−1, ot) where each o is a joint observation. The number of nodes in this tree is |A|t|Ω|t — exponential in t.
The expectation is over the joint stochastic process defined by T, O, and the policy π. We compute it by summing over all possible observation histories, weighting by their probability under the joint policy.
python def evaluate_joint_policy(pomg, pi, b0, horizon): """Monte Carlo evaluation of joint policy pi in POMG. pomg: POMGame object (step, reward methods). pi: list of n policy functions pi[i](obs_history, action_history). b0: initial belief (distribution over states). Returns estimated value for each agent.""" n = pomg.n_agents total_values = [0.0] * n n_rollouts = 200 for _ in range(n_rollouts): s = pomg.sample_state(b0) # sample initial state obs_hist = [[] for _ in range(n)] act_hist = [[] for _ in range(n)] discount = 1.0 for t in range(horizon): # Each agent chooses action from their private history actions = [pi[i](obs_hist[i], act_hist[i]) for i in range(n)] # Step the environment sp, obs, rewards = pomg.step(s, actions) for i in range(n): total_values[i] += discount * rewards[i] obs_hist[i].append(obs[i]) act_hist[i].append(actions[i]) s = sp; discount *= pomg.gamma return [v / n_rollouts for v in total_values]
A joint policy (π1*, …, πn*) is a Nash equilibrium for a POMG if no agent can improve their expected value by unilaterally changing their behavioral policy, given the other agents' equilibrium policies:
Note: πi here is a behavioral policy — it maps each possible private history to an action distribution. This is a much richer strategy space than normal form strategies (single probability vector) or Markov policies (state-indexed).
Simplified 2-card poker: each player is dealt one card (High/Low) privately. Player 1 can Bet or Check. Drag to see the Nash equilibrium bluffing strategy — the optimal mix of betting with low cards to keep the opponent from exploiting pure strategies.
For finite-horizon POMGs, dynamic programming provides an exact backward induction algorithm. Starting from the last time step and working backwards, we compute the optimal behavioral strategies at each decision point.
Define the reachability of an observation history h under joint policy π and initial belief b0. The backward induction computes, at each observation node, the Nash equilibrium of the stage game with continuation values as payoffs.
python def backward_induction_pomg(pomg, b0, horizon): """Exact backward induction for finite-horizon 2-player POMG. Returns Nash equilibrium strategies and value.""" def solve_node(belief, t, hist1, hist2): # Stage game payoffs: integrate over states weighted by belief U1 = np.zeros((pomg.nA1, pomg.nA2)) U2 = np.zeros((pomg.nA1, pomg.nA2)) for a1 in range(pomg.nA1): for a2 in range(pomg.nA2): for s, bs in enumerate(belief): r1 = pomg.R1[s, a1, a2]; r2 = pomg.R2[s, a1, a2] if t < horizon: # Continuation: sum over next states and obs for sp, o1, o2, prob in pomg.transitions(s, a1, a2): b_next = pomg.belief_update(belief, a1, a2, o1, o2) v1, v2, _, _ = solve_node(b_next, t+1, hist1+[o1], hist2+[o2]) r1 += pomg.gamma * prob * v1 r2 += pomg.gamma * prob * v2 U1[a1, a2] += bs * r1 U2[a1, a2] += bs * r2 pi1, pi2 = solve_nash_2x2(U1, U2) v1 = pi1 @ U1 @ pi2; v2 = pi1 @ U2 @ pi2 return v1, v2, pi1, pi2 return solve_node(b0, 0, [], [])
Instead of assuming agents reason to Nash equilibrium globally, Interactive POMDPs (I-POMDPs) (Gmytrasiewicz & Doshi, 2005) model agent i's planning as a POMDP where the state space includes the beliefs and intentions of other agents.
In an I-POMDP for agent i, the "augmented state" is (s, θ−i) where s is the physical state and θ−i is a model of each opponent j — specifically, j's type, including j's belief about the world and j's strategy.
Agent i maintains a belief over both the physical state and the opponent's type. When i observes (ai, oi), they update this joint belief using Bayes' rule. This is like a POMDP with an augmented state space S × Θ−i.
A pursuer (orange) tries to catch an evader (teal) in a grid. Both have noisy sensors — they observe approximate positions, not exact ones. This is a zero-sum POMG: the pursuer wants to minimize time to catch; the evader wants to maximize it.
Pursuer (orange) sees noisy position of evader. Evader (teal) sees noisy position of pursuer. Both use greedy best-response to their current belief. Drag noise slider and watch how tracking degrades.
Exact POMG solving is intractable for all but toy problems. The same approximation techniques from single-agent POMDPs (Ch 20–22) extend to the multiagent setting, often with reduced guarantees.
| Technique | Single-agent (POMDP) | Multi-agent (POMG) |
|---|---|---|
| Alpha vectors | Exact (Ch 20) | Per-agent alpha vectors; no convergence guarantee |
| Point-based (PBVI) | Lower bound (Ch 21) | Can be extended; Nash stage game at each belief point |
| Online MCTS (POMCP) | Asymptotically optimal (Ch 22) | MCTS with Nash rollouts; practical for zero-sum |
| Policy gradient | Converges (Ch 11) | Non-stationary; convergence requires conditions |
| Controllers (FSC) | Finite memory (Ch 23) | Per-agent FSC; optimize jointly or via gradient |
POMGs sit at the intersection of game theory and partially observable planning. The two key extensions from this chapter are Dec-POMDPs (Ch 27) for fully cooperative settings, and I-POMDPs for bounded-rationality opponent modeling.
| Model | Observability | Utilities | Complexity |
|---|---|---|---|
| MDP | Full | Single agent | P |
| POMDP | Partial | Single agent | PSPACE |
| Markov game | Full | Multi-agent | PPAD (Nash) |
| POMG | Partial | Multi-agent | PSPACE + PPAD |
| Dec-POMDP | Partial | Shared reward | NEXP-complete |
"In the long run, it may turn out that partial observability is the most important aspect of multiagent systems — more important than the number of agents, more important than the reward structure."
— Paraphrase, Oliehoek & Amato, A Concise Introduction to Decentralized POMDPs (2016)