Ch 27: Collaborative Agents — Algorithms for Decision Making

Chapter 0: The Coordination Problem

Two firefighters enter a burning building from different sides. They share one goal: rescue survivors. But they can't talk — radio silence to avoid tipping off structural instabilities. Each only hears what's in their own section. Neither knows what the other has found, or where they are.

How should they plan? If Firefighter 1 goes left, should Firefighter 2 also go left (to assist) or right (to cover more ground)? The answer depends on what the other found — which neither can know.

This is the decentralized POMDP (Dec-POMDP) problem: multiple agents, shared reward, partial observations, no communication. It's cooperative — there's no conflict, only uncertainty. And yet it's extraordinarily hard: NEXP-complete, strictly harder than the single-agent POMDP (PSPACE).

Why is cooperation under uncertainty harder than competition? In a competitive game (Ch 24–26), each agent optimizes for themselves. Independence is a feature — each agent's POMDP is well-defined given a model of opponents. In Dec-POMDPs, agents share a reward but can't see each other's information. The optimal joint policy cannot be decomposed into independent agent policies — they must be implicitly coordinated by the joint plan, even though each agent acts on private observations.

The Coordination Gap

Two agents, four rooms. A target can appear in any room. With communication, they can optimally divide. Without communication, they might both go to the same room. Click "Coordinate" vs "Independent" to see the difference in coverage.

Click a policy type to compare coverage.

The price of decentralization: With perfect communication, two agents achieve the same value as a centralized POMDP. With zero communication (Dec-POMDP), the best achievable value can be strictly lower. The gap is the cost of decentralization. In the Tiger problem (a standard POMDP benchmark), the Dec-POMDP version has significantly lower optimal value because agents can't coordinate their listening decisions.

Why is the Dec-POMDP problem NEXP-complete while a single-agent POMDP is only PSPACE?

Because the joint policy space grows doubly-exponentially: each agent has exponentially many policies in the horizon, and the joint policy is the product Because agents must reason about each other's beliefs (like I-POMDPs) Because Dec-POMDPs have larger state spaces

Chapter 1: Dec-POMDP Formalism

A decentralized POMDP (Dec-POMDP) (Bernstein et al., 2002) for n agents is (I, S, A₁, …, A_n, T, R, Ω₁, …, Ω_n, O, b₀):

Component	Notation	Key difference from POMG
Shared reward	R(s, a)	Single reward function, not per-agent R_i
Observations	Ω_i, O(o₁…o_n\|s',a)	Private — agent i only sees o_i
No communication	—	No message-passing channel between agents
Initial belief	b₀(s)	Common prior over initial state

Agents aim to maximize the joint expected reward ∑_t γ^t R(s_t, a_t), but each agent i chooses a_i based only on their private observation history (o_i,1, a_i,1, …, o_i,t).

A local policy for agent i is π_i : history_i → A_i. An optimal joint policy (π₁*, …, π_n*) maximizes V^π(b₀) = E^π[∑_t γ^t R_t | b₀].

Why there's no "belief state" for Dec-POMDPs: In a single-agent POMDP, the belief b(s) is a sufficient statistic for the history. In a Dec-POMDP, the relevant history for the optimal joint decision is (o_1,1:t, o_2,1:t, …) — all agents' observations. But no single agent has this full history. The "joint belief" is a sufficient statistic but isn't available to any agent for policy computation.

Policy representation: local histories vs. finite state controllers: A local policy can be represented as a decision tree over observation histories (exponential in horizon T) or as a finite state controller (FSC, Ch 23). For Dec-POMDPs, each agent has their own FSC. The joint Dec-POMDP policy is a pair of FSCs, one per agent, with no shared state. This is the key representation for infinite-horizon Dec-POMDPs.

What makes the Dec-POMDP reward structure fundamentally different from a general POMG?

All agents share a single reward function R(s,a) rather than having individual rewards R_i Agents in Dec-POMDPs can observe each other's actions Dec-POMDPs don't require a transition model

Chapter 2: Subclasses

Dec-POMDPs are NEXP-complete in general. Several subclasses trade generality for tractability. Understanding the subclass structure shows which assumptions buy which computational savings.

Dec-MDP

A Dec-POMDP where the joint state is fully observable: given the joint observation (o₁, …, o_n), the state s is fully determined. Each agent has a private local observation o_i = f_i(s) for some deterministic function f_i. This is not the same as full observability — agents still don't see each other's observations.

Complexity: NP-complete (easier than Dec-POMDP). Solvable by dynamic programming on the joint observation tree.

MMDP

A multiagent MDP: fully observable state, multiple agents, shared reward. No observation model needed. Each agent sees the full state s. Optimal joint policy = correlated policy over joint actions (a₁, …, a_n) per state. Since there's no private information, this reduces to solving a single centralized MDP over the joint action space A₁ × … × A_n.

Complexity: P (just an MDP with larger action space).

Model	State obs.	Other's obs.	Shared reward	Complexity
MMDP	Full	Full	Yes	P (MDP)
Dec-MDP	Full (jointly)	Private	Yes	NP
Dec-POMDP	Partial	Private	Yes	NEXP
POMG	Partial	Private	No (per-agent)	PSPACE+PPAD
POMDP	Partial	N/A (1 agent)	N/A	PSPACE

Transition-independent Dec-POMDPs: If the transition factors as T(s'|s,a) = ∏_i T_i(s_i'|s_i,a_i) (each agent affects only their local state), then the problem decomposes into n independent POMDPs coupled only through the shared reward. This structure can be exploited for dramatically more efficient planning. Real-world multi-robot systems often have (approximately) transition-independent dynamics.

An MMDP (multiagent MDP) is equivalent to what kind of single-agent problem?

An MDP with a larger joint action space A₁×…×A_n (since all agents see the full state and share a reward) A POMDP with the belief over all agents' observations A normal form game (since time doesn't matter)

Chapter 3: Dynamic Programming

For finite-horizon Dec-POMDPs, backward induction gives the optimal joint policy. The key representation: at each time step, each agent's policy is a policy tree — a decision tree that maps the sequence of observations (o_i,1, …, o_i,t) to an action at each step.

A policy tree of depth T for agent i with action set A_i and observation set Ω_i has |A_i| × |Ω_i|^T−1 leaves. The space of all policy trees is exponential in T. But we only need to consider the Pareto front of policy trees — those not dominated for any initial state.

The joint policy tree and pruning: Given the set of policy trees Q₁ for agent 1 and Q₂ for agent 2, the joint policy is (π₁, π₂) ∈ Q₁ × Q₂. We evaluate each joint policy pair's value for initial belief b₀, then keep only the optimal pair. But |Q₁| = |A₁|^{|Ω₁|^T} — doubly exponential in T. Even evaluating all pairs is NEXP-hard, confirming the complexity lower bound.

Step 1: Generate

Enumerate all single-step policy trees for each agent: one per action a_i, branching on observations. At depth-1, |A_i| trees per agent.

↓

Step 2: Value

For each joint pair (π₁, π₂), compute V(π₁, π₂, b₀) = expected reward over all states weighted by b₀.

↓

Step 3: Prune

Remove dominated policy trees. For agent 1: π₁ is dominated if ∀ π₂, V(π₁', π₂) ≥ V(π₁, π₂) for some π₁'.

↓

Step 4: Extend

For depth-2: prepend an action and branch on observations. Use pruned depth-1 trees as subtrees. Recurse.

The DP is still exponential in T after pruning: Dominated policy tree pruning helps in practice but doesn't change the worst-case complexity. The number of undominated trees can still be exponential. Dynamic programming for Dec-POMDPs is exact but only practical for small instances (|S|, |A|, |Ω| small and T ≤ 5 or so). All practical algorithms for larger instances are approximations.

In Dec-POMDP dynamic programming, what is a "policy tree" for agent i?

A decision tree that maps the agent's private observation history to actions at each time step, branching on each possible observation A tree of world states reachable by agent i's actions A belief tree over the joint state space

Chapter 4: Iterated Best Response

Exact dynamic programming is intractable for all but small problems. The first practical approximation: iterated best response (IBR) for Dec-POMDPs. Fix all other agents' policies, then optimize the current agent's policy against those fixed policies. Repeat until convergence.

Crucially: with all other agents' policies fixed, agent i's problem reduces to a POMDP. The joint policy of other agents defines a stochastic process over states, and agent i can compute the optimal POMDP policy against this process using any POMDP solver (value iteration, PBVI, POMCP, …).

π_i* ← POMDP-Solve(pomg.marginalize(π_−i))

IBR is coordinate ascent on the joint policy: Each IBR step increases (or maintains) the joint value V^π(b₀). This is coordinate ascent on the policy space — exactly like alternating least squares or EM. It's guaranteed to converge to a local optimum but not necessarily the global optimal policy. The local optima can be far from the global optimum.

Initialize

Start with arbitrary policies π₁⁽⁰⁾, …, π_n⁽⁰⁾.

↓

For each agent i:

Marginalize: fix π_−i^(k). Construct the induced POMDP for agent i. Solve with any POMDP algorithm. Get π_i^(k+1).

↻ until convergence

The IBR induced POMDP: When all agents except i have fixed policies, the world from agent i's perspective is a POMDP with:
States: S (same as original)
Actions: A_i (only i's actions)
Transition: T'(s'|s,a_i) = ∑_{a_−i} ∏_j≠i π_j(a_j|·) T(s'|s,(a_i,a_−i))
Reward: R'(s,a_i) = ∑_{a_−i} ∏_j π_j(a_j|·) R(s,(a_i,a_−i))
This is the "marginalized POMDP" for agent i.

python
def dec_pomdp_ibr(decpomdp, init_policies, max_iter=20, pomdp_solver='pbvi'):
    """Iterated Best Response for Dec-POMDP.
    init_policies: list of n initial policy FSCs.
    Returns locally optimal joint policy."""
    policies = [p.copy() for p in init_policies]
    n = decpomdp.n_agents

    for iteration in range(max_iter):
        improved = False
        for i in range(n):
            # Marginalize: fix all other agents' policies
            induced_pomdp = decpomdp.marginalize(
                agent=i,
                other_policies=policies[:i] + policies[i+1:]
            )
            # Solve the induced POMDP for agent i
            new_pi = solve_pomdp(induced_pomdp, method=pomdp_solver)
            # Check if value improved
            old_val = decpomdp.evaluate(policies)
            policies[i] = new_pi
            new_val = decpomdp.evaluate(policies)
            if new_val > old_val + 1e-6:
                improved = True

        if not improved:
            break  # Converged to local optimum

    return policies

With all other agents' policies fixed, what kind of problem does agent i's optimization reduce to in IBR?

A POMDP — the fixed policies of other agents define a stochastic transition and reward, which agent i can plan against using any POMDP solver A normal form game — agent i picks the best single action A fully observable MDP — other agents' policies reveal the state

Chapter 5: Heuristic Search

Heuristic search for Dec-POMDPs extends MCTS and point-based methods from single-agent POMDPs to the joint policy space. The key insight: instead of searching over all policy trees, we search over the space of joint policies guided by an upper bound on the optimal value.

JESP (Joint Equilibrium-based Search for Policies) is an IBR variant with a more careful search: instead of solving full POMDP at each step, it searches the space of finite-depth policy trees using best-first search. MADDPG-style methods use centralized critics to guide decentralized actors.

Point-based methods for Dec-POMDPs (PBVI): Single-agent PBVI (Ch 21) maintains a set of belief points and computes alpha vectors at each. For Dec-POMDPs, the analog maintains a set of "joint belief points" and computes joint alpha vectors (value functions over joint beliefs). But the "joint belief" space is exponential in the number of agents. PBVI for Dec-POMDPs is exponentially harder than for POMDPs.

JESP Algorithm

Joint Equilibrium-based Search for Policies. Like IBR but: (1) Evaluates the full policy tree not a POMDP. (2) Uses branch-and-bound pruning to skip dominated subtrees. (3) Iterates until no agent can improve. Finds locally optimal joint policy trees efficiently for horizon T ≤ 10.

MCTS for Dec-POMDPs

Sample joint observation histories via rollouts. Back-propagate average rewards. At each joint history node, choose joint actions to maximize UCB1 score. Works online (plan at each step rather than ahead of time). Approximate but scalable.

The DFSSP approach (direct search over joint plan space): Nair et al. (2003) proposed framing Dec-POMDP optimization as a search over the joint policy tree structure. Instead of alternating IBR, directly optimize all agents' trees simultaneously. Use heuristics (greedy rollouts, random restarts) to avoid local optima. This is the foundation of most practical Dec-POMDP solvers including HSVI-Dec and SARSOP-Dec.

Why is point-based value iteration exponentially harder for Dec-POMDPs than for single-agent POMDPs?

The joint belief space (over all agents' private histories) grows exponentially in the number of agents, making the belief simplex exponentially larger Dec-POMDPs have continuous state spaces that PBVI can't handle There are no alpha vectors in Dec-POMDPs

Chapter 6: Nonlinear Programming

For finite-horizon Dec-POMDPs with small action and observation sets, nonlinear programming (NLP) can find locally optimal joint policies by treating the policy parameters as continuous variables and using gradient methods or general-purpose NLP solvers.

Represent each agent's policy as a stochastic policy tree: at each observation history node, a probability distribution over actions. Stack all these distributions into a parameter vector θ ∈ [0,1]^d. The joint value V(θ; b₀) is then a function of θ — specifically, a multilinear polynomial in the policy parameters.

V(θ; b₀) = ∑_{π₁, π₂} [∏_i π_i(θ_i)] · V(π₁, π₂, b₀)

Since the objective is multilinear (linear in each agent's policy parameters separately, but not jointly), it's non-convex. NLP finds local optima. Multiple random restarts are used to improve solution quality.

Why multilinear? The value function for a fixed joint policy is linear in the initial belief b₀. When we write the joint policy as a product of per-agent stochastic choices, the probability of any joint action trajectory is a product of the individual policy parameters — making V multilinear. Each agent's policy parameters appear linearly (when others are fixed), but cross terms make the full problem non-convex.

python
import numpy as np
from scipy.optimize import minimize

def dec_pomdp_nlp(decpomdp, T, n_restarts=10):
    """Nonlinear programming for finite-horizon Dec-POMDP.
    T: horizon. Returns best found joint policy."""
    n_params = decpomdp.policy_tree_size(T)  # dim of theta vector
    best_val = -np.inf; best_theta = None

    for _ in range(n_restarts):
        theta0 = np.random.dirichlet(np.ones(decpomdp.nA1), size=n_params//decpomdp.nA1).flatten()

        def neg_value(theta):
            # Reconstruct policy trees from theta, evaluate
            pi = decpomdp.theta_to_policy(theta, T)
            return -decpomdp.evaluate(pi)

        def neg_grad(theta):
            # Gradient via policy gradient theorem
            pi = decpomdp.theta_to_policy(theta, T)
            return -decpomdp.policy_gradient(pi)

        result = minimize(neg_value, theta0, jac=neg_grad,
                          method='L-BFGS-B',
                          bounds=[(0,1)]*len(theta0))
        if -result.fun > best_val:
            best_val = -result.fun; best_theta = result.x

    return decpomdp.theta_to_policy(best_theta, T), best_val

NLP for infinite-horizon Dec-POMDPs (FSC parameters): For infinite-horizon, use finite state controllers (FSC, Ch 23) for each agent. The policy parameters are (ψ_i, η_i) — node action distributions and transition distributions. The value satisfies a system of polynomial equations (as in Ch 23). Gradient ascent on FSC parameters works, with the same multilinear structure and local-optima caveat.

Why is the Dec-POMDP NLP objective called "multilinear" in the policy parameters?

It's linear in each individual agent's policy parameters separately (when others are fixed), but non-convex jointly due to cross terms between agents It involves multiple linear constraints from the belief simplex The gradient requires solving multiple linear systems

Chapter 7: SHOWCASE — Firefighter Dec-POMDP

Two firefighters search a 4×4 building for a survivor. The survivor is in one of the rooms (unknown). Each firefighter hears noisy signals (crying, heat) from their area. No radio contact. They must coordinate purely through their shared protocol (the joint policy).

How coordination works without communication: The key insight of Dec-POMDP solutions: agents can coordinate through their shared initial plan. If both agents know the joint policy (decided before the mission), they can predict what the other will do based on their own observations. "I heard crying in Room A, so I know that, according to our plan, my partner should search Room B." No communication needed — the plan itself coordinates.

SHOWCASE: Decentralized Firefighter Search

Firefighter 1 (teal) starts top-left, Firefighter 2 (orange) starts top-right. The survivor (gold star) is in a random room. Watch how they cover the building with vs. without a coordinated plan. Drag noise to control sensor quality.

Sensor noise 0.20

Step 0 FF1 explored: 0 FF2 explored: 0 Survivor: Not found

What to observe: With the coordinated plan, agents divide the grid — one searches the left half, one the right. They find the survivor quickly without overlap. With independent greedy search, both agents may gravitate to the same area (following their sensor readings) and miss the other half entirely. Random baseline covers the grid but inefficiently. The coordinated plan wins even at high noise because the coordination is in the plan structure, not the sensor quality.

Chapter 8: Complexity & Bounds

Bernstein et al. (2002) proved that Dec-POMDPs are NEXP-complete for the finite-horizon case. This is a doubly-exponential lower bound: algorithms must take time at least 2^2ⁿ in the worst case (for some parameterization). Understanding where this complexity comes from helps identify tractable subclasses.

Where the NEXP hardness comes from: Consider finite-horizon Dec-POMDP with horizon T. Agent i's policy is a tree of depth T branching on |Ω_i| observations. The number of distinct policy trees for agent i is |A_i|^{|Ω_i|^T} — already exponential in an exponential. The joint policy space is the product of these two towers. Even checking if a policy is optimal requires examining all alternatives — the verification problem is NP, making the optimization NEXP (nondeterministic PSPACE).

Upper Bounds

The centralized POMDP (agents can communicate perfectly) gives an upper bound: V^POMDP*(b₀) ≥ V^Dec-POMDP*(b₀). Any POMDP solver gives an upper bound on the achievable Dec-POMDP value.

Tighter: the centralized MDP (full observability + communication) gives V^MDP* ≥ V^POMDP* ≥ V^Dec-POMDP*.

Lower Bounds

Any feasible joint policy gives a lower bound: evaluate a random or heuristic joint policy, and that value is achievable (the optimal must be at least as good). IBR and NLP give local-optimum lower bounds. The gap between upper and lower bounds measures solution quality.

Method	Bound type	Quality	Cost
Centralized MDP value	Upper	Loose	O(\|S\|\|A\|ⁿ)
Centralized POMDP value	Upper	Tight	PSPACE (exp)
Random joint policy	Lower	Loose	O(1)
IBR	Lower	Local opt.	Poly in policy size
Exact DP	Optimal	Exact	NEXP

The centralized POMDP value (agents can perfectly communicate) gives what kind of bound on the optimal Dec-POMDP value?

An upper bound — communication can only help, so the centralized value is at least as large as the decentralized optimal A lower bound — coordinated agents achieve more An exact bound — communication doesn't change the optimal value

Chapter 9: Connections & Beyond

Dec-POMDPs are the hardest formalism in this book (NEXP-complete). But they capture the essential challenge of real multi-robot systems: shared goals, private sensors, no communication. Understanding the exact formalism tells you how much you're approximating in practice.

Relaxation	What you gain	Model
Allow perfect communication	Reduces to POMDP (PSPACE)	Centralized POMDP
Assume full observability	Reduces to MMDP (P)	Cooperative Markov game
Transition-independence	Decomposable into n POMDPs	Factored Dec-POMDP
Finite state controllers	Infinite-horizon tractable approximation	Dec-FSC (NLP)
Local interactions	Graphical model structure	ND-POMDP (networked)

Modern multi-robot systems bypass Dec-POMDP hardness via: (1) Online replanning: plan a short horizon at each step (MCTS-style), accepting suboptimality. (2) Communication scheduling: allow limited, costly communication — reduces to Dec-POMDP with communication actions. (3) Centralized training, decentralized execution (from Ch 25): train a centralized policy offline, execute individual components at deployment. (4) Emergent communication: learn a communication protocol as part of the policy (differentiable multi-agent RL).

The road not taken: NEXP vs. practical: The NEXP-completeness of Dec-POMDPs has been known since 2002. In the 20+ years since, no polynomial algorithm has been found (nor is one expected). The field has instead focused on: (1) tractable subclasses (transition-independence, local interactions), (2) approximation algorithms with bounded suboptimality guarantees, and (3) learning-based approaches that don't require explicit model specification. This is the standard response to NP/PSPACE/NEXP hardness in AI.

"The decentralized POMDP captures what is arguably the most fundamental problem in multi-agent systems: how should a team of agents with private information act to achieve a common goal? The difficulty of the answer — NEXP-complete — tells us that this problem is not going to be solved by cleverness alone."
— Paraphrase, Bernstein, Givan, Immerman, Zilberstein, "The Complexity of Decentralized Control of Markov Decision Processes" (2002)

What is the key insight that makes the "centralized training, decentralized execution" paradigm work for Dec-POMDPs?

During training, agents can share information to find a good joint policy; at execution, each agent follows their local policy component without communication, and the coordination is baked into the plan Centralized training gives each agent a centralized policy that they execute independently The centralized trainer eliminates the partial observability during execution

← Ch 26: State Uncertainty Index Back to Index →