Kochenderfer, Wheeler & Wray, Chapter 27

Collaborative Agents

When agents share a goal but can't see each other's observations — no communication, no shared view, just private sensors and a common mission — coordination becomes the hardest problem in multiagent systems.

Prerequisites: POMDPs (Ch 19–22) + Ch 24–26 multiagent basics. Beliefs and planning must be familiar.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: The Coordination Problem

Two firefighters enter a burning building from different sides. They share one goal: rescue survivors. But they can't talk — radio silence to avoid tipping off structural instabilities. Each only hears what's in their own section. Neither knows what the other has found, or where they are.

How should they plan? If Firefighter 1 goes left, should Firefighter 2 also go left (to assist) or right (to cover more ground)? The answer depends on what the other found — which neither can know.

This is the decentralized POMDP (Dec-POMDP) problem: multiple agents, shared reward, partial observations, no communication. It's cooperative — there's no conflict, only uncertainty. And yet it's extraordinarily hard: NEXP-complete, strictly harder than the single-agent POMDP (PSPACE).

Why is cooperation under uncertainty harder than competition? In a competitive game (Ch 24–26), each agent optimizes for themselves. Independence is a feature — each agent's POMDP is well-defined given a model of opponents. In Dec-POMDPs, agents share a reward but can't see each other's information. The optimal joint policy cannot be decomposed into independent agent policies — they must be implicitly coordinated by the joint plan, even though each agent acts on private observations.
The Coordination Gap

Two agents, four rooms. A target can appear in any room. With communication, they can optimally divide. Without communication, they might both go to the same room. Click "Coordinate" vs "Independent" to see the difference in coverage.

Click a policy type to compare coverage.
The price of decentralization: With perfect communication, two agents achieve the same value as a centralized POMDP. With zero communication (Dec-POMDP), the best achievable value can be strictly lower. The gap is the cost of decentralization. In the Tiger problem (a standard POMDP benchmark), the Dec-POMDP version has significantly lower optimal value because agents can't coordinate their listening decisions.
Why is the Dec-POMDP problem NEXP-complete while a single-agent POMDP is only PSPACE?

Chapter 1: Dec-POMDP Formalism

A decentralized POMDP (Dec-POMDP) (Bernstein et al., 2002) for n agents is (I, S, A1, …, An, T, R, Ω1, …, Ωn, O, b0):

ComponentNotationKey difference from POMG
Shared rewardR(s, a)Single reward function, not per-agent Ri
ObservationsΩi, O(o1…on|s',a)Private — agent i only sees oi
No communicationNo message-passing channel between agents
Initial beliefb0(s)Common prior over initial state

Agents aim to maximize the joint expected reward ∑t γt R(st, at), but each agent i chooses ai based only on their private observation history (oi,1, ai,1, …, oi,t).

A local policy for agent i is πi : historyi → Ai. An optimal joint policy1*, …, πn*) maximizes Vπ(b0) = Eπ[∑t γt Rt | b0].

Why there's no "belief state" for Dec-POMDPs: In a single-agent POMDP, the belief b(s) is a sufficient statistic for the history. In a Dec-POMDP, the relevant history for the optimal joint decision is (o1,1:t, o2,1:t, …) — all agents' observations. But no single agent has this full history. The "joint belief" is a sufficient statistic but isn't available to any agent for policy computation.
Policy representation: local histories vs. finite state controllers: A local policy can be represented as a decision tree over observation histories (exponential in horizon T) or as a finite state controller (FSC, Ch 23). For Dec-POMDPs, each agent has their own FSC. The joint Dec-POMDP policy is a pair of FSCs, one per agent, with no shared state. This is the key representation for infinite-horizon Dec-POMDPs.
What makes the Dec-POMDP reward structure fundamentally different from a general POMG?

Chapter 2: Subclasses

Dec-POMDPs are NEXP-complete in general. Several subclasses trade generality for tractability. Understanding the subclass structure shows which assumptions buy which computational savings.

Dec-MDP

A Dec-POMDP where the joint state is fully observable: given the joint observation (o1, …, on), the state s is fully determined. Each agent has a private local observation oi = fi(s) for some deterministic function fi. This is not the same as full observability — agents still don't see each other's observations.

Complexity: NP-complete (easier than Dec-POMDP). Solvable by dynamic programming on the joint observation tree.

MMDP

A multiagent MDP: fully observable state, multiple agents, shared reward. No observation model needed. Each agent sees the full state s. Optimal joint policy = correlated policy over joint actions (a1, …, an) per state. Since there's no private information, this reduces to solving a single centralized MDP over the joint action space A1 × … × An.

Complexity: P (just an MDP with larger action space).

ModelState obs.Other's obs.Shared rewardComplexity
MMDPFullFullYesP (MDP)
Dec-MDPFull (jointly)PrivateYesNP
Dec-POMDPPartialPrivateYesNEXP
POMGPartialPrivateNo (per-agent)PSPACE+PPAD
POMDPPartialN/A (1 agent)N/APSPACE
Transition-independent Dec-POMDPs: If the transition factors as T(s'|s,a) = ∏i Ti(si'|si,ai) (each agent affects only their local state), then the problem decomposes into n independent POMDPs coupled only through the shared reward. This structure can be exploited for dramatically more efficient planning. Real-world multi-robot systems often have (approximately) transition-independent dynamics.
An MMDP (multiagent MDP) is equivalent to what kind of single-agent problem?

Chapter 3: Dynamic Programming

For finite-horizon Dec-POMDPs, backward induction gives the optimal joint policy. The key representation: at each time step, each agent's policy is a policy tree — a decision tree that maps the sequence of observations (oi,1, …, oi,t) to an action at each step.

A policy tree of depth T for agent i with action set Ai and observation set Ωi has |Ai| × |Ωi|T−1 leaves. The space of all policy trees is exponential in T. But we only need to consider the Pareto front of policy trees — those not dominated for any initial state.

The joint policy tree and pruning: Given the set of policy trees Q1 for agent 1 and Q2 for agent 2, the joint policy is (π1, π2) ∈ Q1 × Q2. We evaluate each joint policy pair's value for initial belief b0, then keep only the optimal pair. But |Q1| = |A1|1|T — doubly exponential in T. Even evaluating all pairs is NEXP-hard, confirming the complexity lower bound.
Step 1: Generate
Enumerate all single-step policy trees for each agent: one per action ai, branching on observations. At depth-1, |Ai| trees per agent.
Step 2: Value
For each joint pair (π1, π2), compute V(π1, π2, b0) = expected reward over all states weighted by b0.
Step 3: Prune
Remove dominated policy trees. For agent 1: π1 is dominated if ∀ π2, V(π1', π2) ≥ V(π1, π2) for some π1'.
Step 4: Extend
For depth-2: prepend an action and branch on observations. Use pruned depth-1 trees as subtrees. Recurse.
The DP is still exponential in T after pruning: Dominated policy tree pruning helps in practice but doesn't change the worst-case complexity. The number of undominated trees can still be exponential. Dynamic programming for Dec-POMDPs is exact but only practical for small instances (|S|, |A|, |Ω| small and T ≤ 5 or so). All practical algorithms for larger instances are approximations.
In Dec-POMDP dynamic programming, what is a "policy tree" for agent i?

Chapter 4: Iterated Best Response

Exact dynamic programming is intractable for all but small problems. The first practical approximation: iterated best response (IBR) for Dec-POMDPs. Fix all other agents' policies, then optimize the current agent's policy against those fixed policies. Repeat until convergence.

Crucially: with all other agents' policies fixed, agent i's problem reduces to a POMDP. The joint policy of other agents defines a stochastic process over states, and agent i can compute the optimal POMDP policy against this process using any POMDP solver (value iteration, PBVI, POMCP, …).

πi* ← POMDP-Solve(pomg.marginalize(π−i))
IBR is coordinate ascent on the joint policy: Each IBR step increases (or maintains) the joint value Vπ(b0). This is coordinate ascent on the policy space — exactly like alternating least squares or EM. It's guaranteed to converge to a local optimum but not necessarily the global optimal policy. The local optima can be far from the global optimum.
Initialize
Start with arbitrary policies π1(0), …, πn(0).
For each agent i:
Marginalize: fix π−i(k). Construct the induced POMDP for agent i. Solve with any POMDP algorithm. Get πi(k+1).
↻ until convergence
The IBR induced POMDP: When all agents except i have fixed policies, the world from agent i's perspective is a POMDP with:
States: S (same as original)
Actions: Ai (only i's actions)
Transition: T'(s'|s,ai) = ∑a−ij≠i πj(aj|·) T(s'|s,(ai,a−i))
Reward: R'(s,ai) = ∑a−ij πj(aj|·) R(s,(ai,a−i))
This is the "marginalized POMDP" for agent i.
python
def dec_pomdp_ibr(decpomdp, init_policies, max_iter=20, pomdp_solver='pbvi'):
    """Iterated Best Response for Dec-POMDP.
    init_policies: list of n initial policy FSCs.
    Returns locally optimal joint policy."""
    policies = [p.copy() for p in init_policies]
    n = decpomdp.n_agents

    for iteration in range(max_iter):
        improved = False
        for i in range(n):
            # Marginalize: fix all other agents' policies
            induced_pomdp = decpomdp.marginalize(
                agent=i,
                other_policies=policies[:i] + policies[i+1:]
            )
            # Solve the induced POMDP for agent i
            new_pi = solve_pomdp(induced_pomdp, method=pomdp_solver)
            # Check if value improved
            old_val = decpomdp.evaluate(policies)
            policies[i] = new_pi
            new_val = decpomdp.evaluate(policies)
            if new_val > old_val + 1e-6:
                improved = True

        if not improved:
            break  # Converged to local optimum

    return policies
With all other agents' policies fixed, what kind of problem does agent i's optimization reduce to in IBR?

Chapter 5: Heuristic Search

Heuristic search for Dec-POMDPs extends MCTS and point-based methods from single-agent POMDPs to the joint policy space. The key insight: instead of searching over all policy trees, we search over the space of joint policies guided by an upper bound on the optimal value.

JESP (Joint Equilibrium-based Search for Policies) is an IBR variant with a more careful search: instead of solving full POMDP at each step, it searches the space of finite-depth policy trees using best-first search. MADDPG-style methods use centralized critics to guide decentralized actors.

Point-based methods for Dec-POMDPs (PBVI): Single-agent PBVI (Ch 21) maintains a set of belief points and computes alpha vectors at each. For Dec-POMDPs, the analog maintains a set of "joint belief points" and computes joint alpha vectors (value functions over joint beliefs). But the "joint belief" space is exponential in the number of agents. PBVI for Dec-POMDPs is exponentially harder than for POMDPs.

JESP Algorithm

Joint Equilibrium-based Search for Policies. Like IBR but: (1) Evaluates the full policy tree not a POMDP. (2) Uses branch-and-bound pruning to skip dominated subtrees. (3) Iterates until no agent can improve. Finds locally optimal joint policy trees efficiently for horizon T ≤ 10.

MCTS for Dec-POMDPs

Sample joint observation histories via rollouts. Back-propagate average rewards. At each joint history node, choose joint actions to maximize UCB1 score. Works online (plan at each step rather than ahead of time). Approximate but scalable.

The DFSSP approach (direct search over joint plan space): Nair et al. (2003) proposed framing Dec-POMDP optimization as a search over the joint policy tree structure. Instead of alternating IBR, directly optimize all agents' trees simultaneously. Use heuristics (greedy rollouts, random restarts) to avoid local optima. This is the foundation of most practical Dec-POMDP solvers including HSVI-Dec and SARSOP-Dec.
Why is point-based value iteration exponentially harder for Dec-POMDPs than for single-agent POMDPs?

Chapter 6: Nonlinear Programming

For finite-horizon Dec-POMDPs with small action and observation sets, nonlinear programming (NLP) can find locally optimal joint policies by treating the policy parameters as continuous variables and using gradient methods or general-purpose NLP solvers.

Represent each agent's policy as a stochastic policy tree: at each observation history node, a probability distribution over actions. Stack all these distributions into a parameter vector θ ∈ [0,1]d. The joint value V(θ; b0) is then a function of θ — specifically, a multilinear polynomial in the policy parameters.

V(θ; b0) = ∑π1, π2 [∏i πii)] · V(π1, π2, b0)

Since the objective is multilinear (linear in each agent's policy parameters separately, but not jointly), it's non-convex. NLP finds local optima. Multiple random restarts are used to improve solution quality.

Why multilinear? The value function for a fixed joint policy is linear in the initial belief b0. When we write the joint policy as a product of per-agent stochastic choices, the probability of any joint action trajectory is a product of the individual policy parameters — making V multilinear. Each agent's policy parameters appear linearly (when others are fixed), but cross terms make the full problem non-convex.
python
import numpy as np
from scipy.optimize import minimize

def dec_pomdp_nlp(decpomdp, T, n_restarts=10):
    """Nonlinear programming for finite-horizon Dec-POMDP.
    T: horizon. Returns best found joint policy."""
    n_params = decpomdp.policy_tree_size(T)  # dim of theta vector
    best_val = -np.inf; best_theta = None

    for _ in range(n_restarts):
        theta0 = np.random.dirichlet(np.ones(decpomdp.nA1), size=n_params//decpomdp.nA1).flatten()

        def neg_value(theta):
            # Reconstruct policy trees from theta, evaluate
            pi = decpomdp.theta_to_policy(theta, T)
            return -decpomdp.evaluate(pi)

        def neg_grad(theta):
            # Gradient via policy gradient theorem
            pi = decpomdp.theta_to_policy(theta, T)
            return -decpomdp.policy_gradient(pi)

        result = minimize(neg_value, theta0, jac=neg_grad,
                          method='L-BFGS-B',
                          bounds=[(0,1)]*len(theta0))
        if -result.fun > best_val:
            best_val = -result.fun; best_theta = result.x

    return decpomdp.theta_to_policy(best_theta, T), best_val
NLP for infinite-horizon Dec-POMDPs (FSC parameters): For infinite-horizon, use finite state controllers (FSC, Ch 23) for each agent. The policy parameters are (ψi, ηi) — node action distributions and transition distributions. The value satisfies a system of polynomial equations (as in Ch 23). Gradient ascent on FSC parameters works, with the same multilinear structure and local-optima caveat.
Why is the Dec-POMDP NLP objective called "multilinear" in the policy parameters?

Chapter 7: SHOWCASE — Firefighter Dec-POMDP

Two firefighters search a 4×4 building for a survivor. The survivor is in one of the rooms (unknown). Each firefighter hears noisy signals (crying, heat) from their area. No radio contact. They must coordinate purely through their shared protocol (the joint policy).

How coordination works without communication: The key insight of Dec-POMDP solutions: agents can coordinate through their shared initial plan. If both agents know the joint policy (decided before the mission), they can predict what the other will do based on their own observations. "I heard crying in Room A, so I know that, according to our plan, my partner should search Room B." No communication needed — the plan itself coordinates.
SHOWCASE: Decentralized Firefighter Search

Firefighter 1 (teal) starts top-left, Firefighter 2 (orange) starts top-right. The survivor (gold star) is in a random room. Watch how they cover the building with vs. without a coordinated plan. Drag noise to control sensor quality.

Sensor noise 0.20
Step 0 FF1 explored: 0 FF2 explored: 0 Survivor: Not found
What to observe: With the coordinated plan, agents divide the grid — one searches the left half, one the right. They find the survivor quickly without overlap. With independent greedy search, both agents may gravitate to the same area (following their sensor readings) and miss the other half entirely. Random baseline covers the grid but inefficiently. The coordinated plan wins even at high noise because the coordination is in the plan structure, not the sensor quality.

Chapter 8: Complexity & Bounds

Bernstein et al. (2002) proved that Dec-POMDPs are NEXP-complete for the finite-horizon case. This is a doubly-exponential lower bound: algorithms must take time at least 22n in the worst case (for some parameterization). Understanding where this complexity comes from helps identify tractable subclasses.

Where the NEXP hardness comes from: Consider finite-horizon Dec-POMDP with horizon T. Agent i's policy is a tree of depth T branching on |Ωi| observations. The number of distinct policy trees for agent i is |Ai|i|T — already exponential in an exponential. The joint policy space is the product of these two towers. Even checking if a policy is optimal requires examining all alternatives — the verification problem is NP, making the optimization NEXP (nondeterministic PSPACE).

Upper Bounds

The centralized POMDP (agents can communicate perfectly) gives an upper bound: VPOMDP*(b0) ≥ VDec-POMDP*(b0). Any POMDP solver gives an upper bound on the achievable Dec-POMDP value.

Tighter: the centralized MDP (full observability + communication) gives VMDP* ≥ VPOMDP* ≥ VDec-POMDP*.

Lower Bounds

Any feasible joint policy gives a lower bound: evaluate a random or heuristic joint policy, and that value is achievable (the optimal must be at least as good). IBR and NLP give local-optimum lower bounds. The gap between upper and lower bounds measures solution quality.

MethodBound typeQualityCost
Centralized MDP valueUpperLooseO(|S||A|n)
Centralized POMDP valueUpperTightPSPACE (exp)
Random joint policyLowerLooseO(1)
IBRLowerLocal opt.Poly in policy size
Exact DPOptimalExactNEXP
The centralized POMDP value (agents can perfectly communicate) gives what kind of bound on the optimal Dec-POMDP value?

Chapter 9: Connections & Beyond

Dec-POMDPs are the hardest formalism in this book (NEXP-complete). But they capture the essential challenge of real multi-robot systems: shared goals, private sensors, no communication. Understanding the exact formalism tells you how much you're approximating in practice.

RelaxationWhat you gainModel
Allow perfect communicationReduces to POMDP (PSPACE)Centralized POMDP
Assume full observabilityReduces to MMDP (P)Cooperative Markov game
Transition-independenceDecomposable into n POMDPsFactored Dec-POMDP
Finite state controllersInfinite-horizon tractable approximationDec-FSC (NLP)
Local interactionsGraphical model structureND-POMDP (networked)
Modern multi-robot systems bypass Dec-POMDP hardness via: (1) Online replanning: plan a short horizon at each step (MCTS-style), accepting suboptimality. (2) Communication scheduling: allow limited, costly communication — reduces to Dec-POMDP with communication actions. (3) Centralized training, decentralized execution (from Ch 25): train a centralized policy offline, execute individual components at deployment. (4) Emergent communication: learn a communication protocol as part of the policy (differentiable multi-agent RL).
The road not taken: NEXP vs. practical: The NEXP-completeness of Dec-POMDPs has been known since 2002. In the 20+ years since, no polynomial algorithm has been found (nor is one expected). The field has instead focused on: (1) tractable subclasses (transition-independence, local interactions), (2) approximation algorithms with bounded suboptimality guarantees, and (3) learning-based approaches that don't require explicit model specification. This is the standard response to NP/PSPACE/NEXP hardness in AI.
"The decentralized POMDP captures what is arguably the most fundamental problem in multi-agent systems: how should a team of agents with private information act to achieve a common goal? The difficulty of the answer — NEXP-complete — tells us that this problem is not going to be solved by cleverness alone."
— Paraphrase, Bernstein, Givan, Immerman, Zilberstein, "The Complexity of Decentralized Control of Markov Decision Processes" (2002)
What is the key insight that makes the "centralized training, decentralized execution" paradigm work for Dec-POMDPs?
← Ch 26: State Uncertainty Index Back to Index →