When agents share a goal but can't see each other's observations — no communication, no shared view, just private sensors and a common mission — coordination becomes the hardest problem in multiagent systems.
Two firefighters enter a burning building from different sides. They share one goal: rescue survivors. But they can't talk — radio silence to avoid tipping off structural instabilities. Each only hears what's in their own section. Neither knows what the other has found, or where they are.
How should they plan? If Firefighter 1 goes left, should Firefighter 2 also go left (to assist) or right (to cover more ground)? The answer depends on what the other found — which neither can know.
This is the decentralized POMDP (Dec-POMDP) problem: multiple agents, shared reward, partial observations, no communication. It's cooperative — there's no conflict, only uncertainty. And yet it's extraordinarily hard: NEXP-complete, strictly harder than the single-agent POMDP (PSPACE).
Two agents, four rooms. A target can appear in any room. With communication, they can optimally divide. Without communication, they might both go to the same room. Click "Coordinate" vs "Independent" to see the difference in coverage.
A decentralized POMDP (Dec-POMDP) (Bernstein et al., 2002) for n agents is (I, S, A1, …, An, T, R, Ω1, …, Ωn, O, b0):
| Component | Notation | Key difference from POMG |
|---|---|---|
| Shared reward | R(s, a) | Single reward function, not per-agent Ri |
| Observations | Ωi, O(o1…on|s',a) | Private — agent i only sees oi |
| No communication | — | No message-passing channel between agents |
| Initial belief | b0(s) | Common prior over initial state |
Agents aim to maximize the joint expected reward ∑t γt R(st, at), but each agent i chooses ai based only on their private observation history (oi,1, ai,1, …, oi,t).
A local policy for agent i is πi : historyi → Ai. An optimal joint policy (π1*, …, πn*) maximizes Vπ(b0) = Eπ[∑t γt Rt | b0].
Dec-POMDPs are NEXP-complete in general. Several subclasses trade generality for tractability. Understanding the subclass structure shows which assumptions buy which computational savings.
A Dec-POMDP where the joint state is fully observable: given the joint observation (o1, …, on), the state s is fully determined. Each agent has a private local observation oi = fi(s) for some deterministic function fi. This is not the same as full observability — agents still don't see each other's observations.
Complexity: NP-complete (easier than Dec-POMDP). Solvable by dynamic programming on the joint observation tree.
A multiagent MDP: fully observable state, multiple agents, shared reward. No observation model needed. Each agent sees the full state s. Optimal joint policy = correlated policy over joint actions (a1, …, an) per state. Since there's no private information, this reduces to solving a single centralized MDP over the joint action space A1 × … × An.
Complexity: P (just an MDP with larger action space).
| Model | State obs. | Other's obs. | Shared reward | Complexity |
|---|---|---|---|---|
| MMDP | Full | Full | Yes | P (MDP) |
| Dec-MDP | Full (jointly) | Private | Yes | NP |
| Dec-POMDP | Partial | Private | Yes | NEXP |
| POMG | Partial | Private | No (per-agent) | PSPACE+PPAD |
| POMDP | Partial | N/A (1 agent) | N/A | PSPACE |
For finite-horizon Dec-POMDPs, backward induction gives the optimal joint policy. The key representation: at each time step, each agent's policy is a policy tree — a decision tree that maps the sequence of observations (oi,1, …, oi,t) to an action at each step.
A policy tree of depth T for agent i with action set Ai and observation set Ωi has |Ai| × |Ωi|T−1 leaves. The space of all policy trees is exponential in T. But we only need to consider the Pareto front of policy trees — those not dominated for any initial state.
Exact dynamic programming is intractable for all but small problems. The first practical approximation: iterated best response (IBR) for Dec-POMDPs. Fix all other agents' policies, then optimize the current agent's policy against those fixed policies. Repeat until convergence.
Crucially: with all other agents' policies fixed, agent i's problem reduces to a POMDP. The joint policy of other agents defines a stochastic process over states, and agent i can compute the optimal POMDP policy against this process using any POMDP solver (value iteration, PBVI, POMCP, …).
python def dec_pomdp_ibr(decpomdp, init_policies, max_iter=20, pomdp_solver='pbvi'): """Iterated Best Response for Dec-POMDP. init_policies: list of n initial policy FSCs. Returns locally optimal joint policy.""" policies = [p.copy() for p in init_policies] n = decpomdp.n_agents for iteration in range(max_iter): improved = False for i in range(n): # Marginalize: fix all other agents' policies induced_pomdp = decpomdp.marginalize( agent=i, other_policies=policies[:i] + policies[i+1:] ) # Solve the induced POMDP for agent i new_pi = solve_pomdp(induced_pomdp, method=pomdp_solver) # Check if value improved old_val = decpomdp.evaluate(policies) policies[i] = new_pi new_val = decpomdp.evaluate(policies) if new_val > old_val + 1e-6: improved = True if not improved: break # Converged to local optimum return policies
Heuristic search for Dec-POMDPs extends MCTS and point-based methods from single-agent POMDPs to the joint policy space. The key insight: instead of searching over all policy trees, we search over the space of joint policies guided by an upper bound on the optimal value.
JESP (Joint Equilibrium-based Search for Policies) is an IBR variant with a more careful search: instead of solving full POMDP at each step, it searches the space of finite-depth policy trees using best-first search. MADDPG-style methods use centralized critics to guide decentralized actors.
Joint Equilibrium-based Search for Policies. Like IBR but: (1) Evaluates the full policy tree not a POMDP. (2) Uses branch-and-bound pruning to skip dominated subtrees. (3) Iterates until no agent can improve. Finds locally optimal joint policy trees efficiently for horizon T ≤ 10.
Sample joint observation histories via rollouts. Back-propagate average rewards. At each joint history node, choose joint actions to maximize UCB1 score. Works online (plan at each step rather than ahead of time). Approximate but scalable.
For finite-horizon Dec-POMDPs with small action and observation sets, nonlinear programming (NLP) can find locally optimal joint policies by treating the policy parameters as continuous variables and using gradient methods or general-purpose NLP solvers.
Represent each agent's policy as a stochastic policy tree: at each observation history node, a probability distribution over actions. Stack all these distributions into a parameter vector θ ∈ [0,1]d. The joint value V(θ; b0) is then a function of θ — specifically, a multilinear polynomial in the policy parameters.
Since the objective is multilinear (linear in each agent's policy parameters separately, but not jointly), it's non-convex. NLP finds local optima. Multiple random restarts are used to improve solution quality.
python import numpy as np from scipy.optimize import minimize def dec_pomdp_nlp(decpomdp, T, n_restarts=10): """Nonlinear programming for finite-horizon Dec-POMDP. T: horizon. Returns best found joint policy.""" n_params = decpomdp.policy_tree_size(T) # dim of theta vector best_val = -np.inf; best_theta = None for _ in range(n_restarts): theta0 = np.random.dirichlet(np.ones(decpomdp.nA1), size=n_params//decpomdp.nA1).flatten() def neg_value(theta): # Reconstruct policy trees from theta, evaluate pi = decpomdp.theta_to_policy(theta, T) return -decpomdp.evaluate(pi) def neg_grad(theta): # Gradient via policy gradient theorem pi = decpomdp.theta_to_policy(theta, T) return -decpomdp.policy_gradient(pi) result = minimize(neg_value, theta0, jac=neg_grad, method='L-BFGS-B', bounds=[(0,1)]*len(theta0)) if -result.fun > best_val: best_val = -result.fun; best_theta = result.x return decpomdp.theta_to_policy(best_theta, T), best_val
Two firefighters search a 4×4 building for a survivor. The survivor is in one of the rooms (unknown). Each firefighter hears noisy signals (crying, heat) from their area. No radio contact. They must coordinate purely through their shared protocol (the joint policy).
Firefighter 1 (teal) starts top-left, Firefighter 2 (orange) starts top-right. The survivor (gold star) is in a random room. Watch how they cover the building with vs. without a coordinated plan. Drag noise to control sensor quality.
Bernstein et al. (2002) proved that Dec-POMDPs are NEXP-complete for the finite-horizon case. This is a doubly-exponential lower bound: algorithms must take time at least 22n in the worst case (for some parameterization). Understanding where this complexity comes from helps identify tractable subclasses.
The centralized POMDP (agents can communicate perfectly) gives an upper bound: VPOMDP*(b0) ≥ VDec-POMDP*(b0). Any POMDP solver gives an upper bound on the achievable Dec-POMDP value.
Tighter: the centralized MDP (full observability + communication) gives VMDP* ≥ VPOMDP* ≥ VDec-POMDP*.
Any feasible joint policy gives a lower bound: evaluate a random or heuristic joint policy, and that value is achievable (the optimal must be at least as good). IBR and NLP give local-optimum lower bounds. The gap between upper and lower bounds measures solution quality.
| Method | Bound type | Quality | Cost |
|---|---|---|---|
| Centralized MDP value | Upper | Loose | O(|S||A|n) |
| Centralized POMDP value | Upper | Tight | PSPACE (exp) |
| Random joint policy | Lower | Loose | O(1) |
| IBR | Lower | Local opt. | Poly in policy size |
| Exact DP | Optimal | Exact | NEXP |
Dec-POMDPs are the hardest formalism in this book (NEXP-complete). But they capture the essential challenge of real multi-robot systems: shared goals, private sensors, no communication. Understanding the exact formalism tells you how much you're approximating in practice.
| Relaxation | What you gain | Model |
|---|---|---|
| Allow perfect communication | Reduces to POMDP (PSPACE) | Centralized POMDP |
| Assume full observability | Reduces to MMDP (P) | Cooperative Markov game |
| Transition-independence | Decomposable into n POMDPs | Factored Dec-POMDP |
| Finite state controllers | Infinite-horizon tractable approximation | Dec-FSC (NLP) |
| Local interactions | Graphical model structure | ND-POMDP (networked) |
"The decentralized POMDP captures what is arguably the most fundamental problem in multi-agent systems: how should a team of agents with private information act to achieve a common goal? The difficulty of the answer — NEXP-complete — tells us that this problem is not going to be solved by cleverness alone."
— Paraphrase, Bernstein, Givan, Immerman, Zilberstein, "The Complexity of Decentralized Control of Markov Decision Processes" (2002)