Ch 23: Controller Abstractions — Algorithms for Decision Making

Chapter 0: Why Controllers?

You've just solved a POMDP. The algorithm hands you a policy π(b) → a. At runtime, you observe o, update your belief b, and query the policy. Sounds fine — until you realize that "updating your belief" requires knowing T(s'|s,a) and O(o|a,s') in full, and that a probability distribution over 10,000 states must be maintained exactly at each step.

What if the robot just doesn't know its model precisely? What if the state space is so large that maintaining a full distribution is prohibitive? What if you need to deploy to an embedded system with 256KB of RAM?

Finite state controllers (FSCs) sidestep the belief entirely. A controller maintains its own internal node x ∈ X = {x₁, ..., x_n}. It selects actions based on x, not on any probability distribution. It transitions x → x' based on the action taken and the observation received. No Bayes. No matrix multiplication. Just a lookup table.

The bet: You're betting that a small finite set of nodes can capture enough "memory" of past observations to act near-optimally. This is a lossy compression of history. The payoff: zero runtime complexity beyond a table lookup. The risk: finite nodes may not be enough for some problems.

Belief-based policy:

Runtime: maintain b(s) for all s. Update b via Bayes after each (a,o). Query π(b). Memory: O(|S|). Requires full POMDP model at runtime.

Controller policy:

Runtime: maintain current node x. Sample action from ψ(·|x). Transition x via η(·|x,a,o). Memory: O(|X|·|A|·|O|). No model needed at runtime.

Controllers generalize conditional plans: A conditional plan (Chapter 20) is a tree — it grows exponentially with depth and can only represent finite-horizon policies. A controller is a directed graph with cycles. A 3-node controller can represent an infinite-horizon policy that would require an infinite tree. This is the same relationship as FSMs vs. parse trees in formal language theory.

What is the primary runtime advantage of a finite state controller over a belief-based policy?

Controllers always find the globally optimal policy Controllers require no belief update or POMDP model at runtime — just a table lookup Controllers have lower memory requirements because they ignore observations

Chapter 1: FSC Structure

A finite state controller is a tuple (X, ψ, η) where:

Component	Notation	Meaning
Node set	X = {x₁, ..., x_n}	Internal states of the controller
Action distribution	ψ(a\|x)	P(take action a \| currently at node x)
Successor distribution	η(x'\|x, a, o)	P(go to node x' \| at x, took a, observed o)

Execution: at node x, sample action a ~ ψ(·|x). Environment produces observation o. Transition to x' ~ η(·|x, a, o). Repeat forever. The policy is entirely encoded in (ψ, η) — two lookup tables.

For the crying baby problem, a simple 2-node controller captures the intuitive policy:

Crying Baby FSC (2-node example)

Node 1 (teal): "default" — take the listen action. If observe crying, jump to Node 2. If quiet, stay. Node 2 (orange): "urgent" — take the feed action. If quiet, return to Node 1. If still crying, stay. Click Animate to watch the controller run for 10 steps.

Step 0 | x=1 | —

Stochastic controllers matter for optimization: Both ψ and η are probability distributions, not deterministic functions. This is crucial for optimization: it makes the policy space continuous and differentiable. A deterministic controller has discrete choices (hard to optimize with gradients). A stochastic controller has real-valued parameters in [0,1] (amenable to gradient ascent and nonlinear programming).

Special case — deterministic controller: When ψ(a|x) ∈ {0,1} and η(x'|x,a,o) ∈ {0,1}, the controller is a deterministic FSM. At each node x, it always takes the same action and always transitions to the same next node given (a,o). Deterministic controllers are simpler to interpret but harder to optimize (the parameter space is discrete).

In an FSC, what determines which action the agent takes at each step?

The action distribution ψ(a|x) at the current node x The current belief state b The full history of past observations

Chapter 2: Evaluating FSCs

Given a controller, how much reward does it actually get? We need to compute V_x(s) = expected discounted return starting in state s and controller node x.

The evaluation is defined by a system of linear equations — one per (x, s) pair. For a stochastic controller:

V_x(s) = ∑_a ψ(a|x) &left;[ R(s,a) + γ ∑_s' T(s'|s,a) ∑_o O(o|a,s') ∑_x' η(x'|x,a,o) V_x'(s') &right;]

This looks complicated, but for a fixed controller it's just a linear system: |X|·|S| equations in |X|·|S| unknowns. Solve it once (e.g., with Gaussian elimination) and you have the value of the controller from every (x, s) pair.

From values to belief values: Given the per-(x,s) values V_x(s) and an initial belief b₀, the expected value of starting the controller in node x_init is: V_{x_init}(b₀) = ∑_s b₀(s) V_{x_init}(s). This is a dot product — each column of V gives an alpha vector! So an n-node controller produces exactly n alpha vectors. This links controllers directly to the alpha-vector machinery from Chapters 20-21.

The computational benefit of evaluation: Evaluating a fixed controller takes O((|X||S|)²) for Gaussian elimination — polynomial in controller size. Evaluating the optimal value function over the belief simplex is PSPACE-complete. So controllers give a polynomial-time computable lower bound on U*(b).

julia
function evaluate_controller(pomdp, fsc)
    nS, nX = length(states(pomdp)), length(fsc.nodes)
    # Build linear system: V = R + γ T O η V
    # Row idx: (x-1)*nS + s. Build (nX*nS) × (nX*nS) matrix.
    A = zeros(nX*nS, nX*nS)
    b = zeros(nX*nS)

    for (xi, x) in enumerate(fsc.nodes)
        for (si, s) in enumerate(states(pomdp))
            row = (xi-1)*nS + si
            A[row, row] = 1.0  # V[x,s] coefficient
            exp_r = 0.0
            for a in actions(pomdp)
                pa = fsc.ψ(a, x)
                exp_r += pa * reward(pomdp, s, a)
                for (si2, s2) in enumerate(states(pomdp))
                    for o in observations(pomdp)
                        for (xi2, x2) in enumerate(fsc.nodes)
                            col = (xi2-1)*nS + si2
                            A[row, col] -= discount(pomdp) * pa *
                                transition(pomdp,s,a,s2) *
                                obs(pomdp,a,s2,o) *
                                fsc.η(x2,x,a,o)
                        end
                    end
                end
            end
            b[row] = exp_r
        end
    end
    V_flat = A \ b
    return reshape(V_flat, nS, nX)  # V[s, x] = value at state s, node x
end

A finite state controller with n nodes produces how many alpha vectors?

Exactly n (one per controller node, each giving V_x(s) for all s) n·|A| (one per node-action pair) Exponentially many, same as conditional plans

Chapter 3: Policy Iteration for Controllers

We know how to evaluate a given controller. Now: how do we improve it? Policy iteration for FSCs alternates between evaluation and a greedy improvement step — the same structure as MDP policy iteration.

The improvement step: given the current value function V_x(s) and the current controller (ψ, η), find new distributions (ψ*, η*) that maximize the value at each (x, s):

(ψ*, η*) = argmax_{ψ, η} ∑_{x, s} b_x(s) V_x(s)

where b_x(s) is the probability of being in state s at node x (the stationary distribution over (x,s) pairs). This is a linear program in the controller parameters: maximize a linear objective over the simplex constraints ∑_a ψ(a|x)=1 and ∑_x' η(x'|x,a,o)=1.

Initialize

Start with a random or heuristic controller (ψ⁽⁰⁾, η⁽⁰⁾). Fix n = |X|.

↓

Evaluate

Solve the linear system to get V_x(s) for all (x,s).

↓

Improve

Solve an LP to find (ψ*, η*) that maximize weighted value.

↓

Check convergence

If value improved by less than ε, stop. Otherwise, update (ψ,η) := (ψ*,η*) and repeat.

Policy iteration converges: Each improvement step is guaranteed to be non-decreasing in value. Since the controller parameter space is compact, and the value is bounded above by U*(b), the algorithm must converge. In practice, it converges quickly — often in 5-15 iterations for small problems.

The LP formulation: The improvement step is a linear program because the value function V_x(s) is fixed (from the evaluation step), so the objective ∑_x,s b_x(s) V_x(s) is linear in the occupancy weights. Maximizing a linear function over the product of simplices is always a linear program, solvable in polynomial time.

Why is FSC policy iteration guaranteed to converge?

Each improvement step is non-decreasing in value, and the value is bounded above by U*(b) The linear program always has a unique optimal solution The controller parameter space is finite

Chapter 4: Node Expansion

Policy iteration with a fixed n-node controller finds the best policy expressible with n nodes. But what if n is too small? The controller might be fundamentally limited, unable to represent anything near the optimal policy regardless of what (ψ, η) we choose.

Node expansion increases n when the current controller has converged to a local optimum. The idea: add a new node x_new and initialize it with a useful purpose. The most effective strategy is to initialize x_new as a copy of the existing node that is most frequently visited weighted by missed value.

After adding x_new, run policy iteration again. The new node gives the optimization more freedom to represent policies that require more "memory." Repeat expansion until the value increase per expansion is below threshold.

Why start small? Starting with n=1 and expanding is better than starting with a large n because: (1) small controllers are faster to optimize (fewer parameters), (2) smaller controllers are easier to interpret, and (3) greedy expansion adds nodes where they help most, rather than allocating n nodes all at once and hoping they find useful roles. This is analogous to neuron growth in neural network architectures.

Controller Size vs. Policy Quality

Crying Baby POMDP. X-axis: number of controller nodes n. Y-axis: value at initial belief P(hungry)=0.5. More nodes = better value (up to a point where near-optimal is reached). Click a bar to see the value.

Click to run

Optimal number of nodes: There's always a minimum n above which more nodes don't help. For the crying baby problem, n=2 or n=3 achieves near-optimal performance. For richer problems (e.g., robot navigation with many landmarks), you might need n=20-50. The key question is not "how many nodes" but "how much history do I need to remember?"

Connection to alpha vectors: An n-node controller produces n alpha vectors {V_x₁, ..., V_{x_n}}. The value at belief b is max_x V_x·b. So evaluating a controller is equivalent to running PBVI with a belief-dependent starting node. Node expansion is equivalent to adding a new alpha vector to Γ. The two approaches are dual.

Node expansion adds a new controller node when policy iteration has converged. Why does this help?

It gives the optimization more freedom to represent policies requiring more observation history Adding nodes always guarantees convergence to the globally optimal policy More nodes reduces the size of the linear program in the improvement step

Chapter 5: Nonlinear Programming Formulation

Policy iteration for FSCs requires solving an LP in the improvement step. But LPs assume a fixed evaluation. What if we want to jointly optimize both the evaluation and the improvement, treating (ψ, η, V) as one big optimization problem?

The FSC optimization over a fixed n can be written as a nonlinear program (NLP). The optimization variables are:

ψ_a,x = probability of taking action a at node x. Constraints: ψ_a,x ≥ 0, ∑_a ψ_a,x = 1 for all x.
η_x',x,a,o = probability of transitioning to x' from x after (a,o). Constraints: η ≥ 0, ∑_x' η_x',x,a,o = 1 for all x, a, o.
V_x,s = value at (x,s). Must satisfy the Bellman equations as constraints.

The objective: maximize V_x₀·b₀ = ∑_s b₀(s) V_x₀(s) for an initial belief b₀ and starting node x₀.

Why nonlinear? The Bellman constraint that V must satisfy is nonlinear in (ψ, η, V): it multiplies ψ and η with V. The constraint: V_x,s = ∑_a ψ_a,x[R(s,a) + γ∑_s',o,x' T(s'|s,a)O(o|a,s')η_x',x,a,oV_x',s']. This is bilinear in (ψ,V) and (η,V) — hence NLP, not LP.

Bilinear programs: The NLP here is specifically a bilinear program: products of two sets of variables. Bilinear programs are NP-hard in general, but the structure here (compact constraint set, well-conditioned Bellman equations) makes gradient-based methods work well in practice. The local optima correspond to locally-good controllers that might not be globally optimal.

python (NLP formulation sketch)
from scipy.optimize import minimize
import numpy as np

def fsc_nlp(pomdp, n_nodes, b0):
    # Pack ψ and η into a single parameter vector θ
    nS, nA, nO = pomdp.nS, pomdp.nA, pomdp.nO
    nX = n_nodes

    def unpack(theta):
        # ψ: (nX, nA) softmax'd per row
        psi_raw = theta[:nX*nA].reshape(nX, nA)
        psi = softmax(psi_raw, axis=1)
        # η: (nX, nA, nO, nX) softmax'd per (x,a,o)
        eta_raw = theta[nX*nA:].reshape(nX, nA, nO, nX)
        eta = softmax(eta_raw, axis=3)
        return psi, eta

    def objective(theta):
        psi, eta = unpack(theta)
        V = evaluate_controller(pomdp, psi, eta)  # solve linear system
        return -np.dot(b0, V[0, :])  # maximize value at node 0

    theta0 = np.random.randn(nX*nA + nX*nA*nO*nX) * 0.1
    result = minimize(objective, theta0, method='L-BFGS-B')
    return unpack(result.x)

The FSC optimization is a nonlinear program rather than a linear program because:

The Bellman constraints involve products of controller parameters (ψ, η) and value variables V The reward function R(s,a) is nonlinear in (s,a) The belief simplex has nonlinear geometry

Chapter 6: Gradient Ascent for Controllers

NLP solvers (L-BFGS, SLSQP) require computing gradients. For FSCs, we need ∂V_x₀(b₀) / ∂ψ and ∂V_x₀(b₀) / ∂η. These can be computed by differentiating through the linear system that defines V.

The gradient of V with respect to controller parameters can be derived from the Bellman equations. Let L = I − γP be the matrix defining the linear system LV = r (where r is the expected reward vector and P encodes the transition and controller dynamics). Then:

∂V / ∂θ = L⁻¹ ∂r / ∂θ − L⁻¹ (∂P / ∂θ) V

where θ represents any controller parameter (ψ_a,x or η_x',x,a,o). This can be computed efficiently by: (1) solving LV = r for V (forward pass), (2) computing the right-hand side, (3) solving L^⊤λ = c for an adjoint vector (backward pass). The full gradient costs O(2 · |X||S| linear system solve).

Gradient Ascent on a 2-Node Controller

Optimize ψ and η for the 2-node crying baby controller via gradient ascent. X-axis: gradient step. Y-axis: value at belief b=[0.5,0.5]. Click Step to take one gradient step, or Run 50 to converge. Watch the value climb.

Step 0 | V(b=0.5) = —

Learning rate α0.020

Local optima are a real problem: Gradient ascent converges to local optima, not global. For FSCs, the landscape can have many local optima corresponding to different "strategies" encoded by the nodes. Running gradient ascent from multiple random initializations and keeping the best result (multi-start optimization) is the standard approach. Adding nodes and re-optimizing also helps escape shallow local optima.

Natural gradient: The standard gradient treats all parameters equally. The natural gradient scales by the Fisher information matrix, accounting for the fact that small changes in ψ near the boundary (e.g., ψ=0.01) have different effects than near the center (e.g., ψ=0.5). Natural gradient methods converge faster for FSC optimization, at the cost of computing the Fisher matrix.

Gradient ascent for FSC optimization is subject to local optima. What is the standard remedy?

Multi-start optimization: run gradient ascent from many random initializations, keep the best Always initialize with the QMDP policy to avoid local optima Use a higher learning rate to escape local optima

Chapter 7: SHOWCASE — FSC Running Live

Watch a finite state controller execute on the crying baby POMDP over 50 steps. The controller's internal node is shown as the active node in the graph. You can see how the controller responds to crying vs. quiet observations, feeding when appropriate and listening to gather information. Compare with the belief-based policy: both make similar decisions, but the controller never computes a probability.

FSC Execution on Crying Baby POMDP

Left: the FSC graph (2 nodes, 3 actions). Highlighted = current node. Right: the actual world state and observations over time. Each step: controller picks an action, environment transitions, observation arrives, controller transitions. Total reward shown below.

Step 0 | Node: x1 | Total reward: 0

Speed400ms

What to notice: The controller uses "listen" when in node x1 (default mode) to gather information, then "feed" in node x2 (urgent mode) when the baby has cried. It doesn't know the true state (hungry/not), but the observation signal (crying/quiet) through the 2-node FSC captures the essential memory needed. Over many steps, the total reward approaches the value predicted by evaluation: V_x1(b₀ = [0.5, 0.5]).

Compare to belief-based execution: A belief-based policy would maintain P(hungry) and update it via Bayes at each step. The FSC achieves similar behavior with no probability computation. The node x1="listen" corresponds loosely to low P(hungry); x2="feed" corresponds to high P(hungry). The controller learns a useful discretization of belief space, even though it never computes beliefs.

An FSC executing on a POMDP never updates a belief distribution. How does it handle uncertainty about the state?

Its internal nodes implicitly encode a summary of history sufficient for near-optimal action selection It assumes the state is always the most likely state given the last observation It ignores uncertainty and always takes the action with highest expected immediate reward

Chapter 8: Connections

Finite state controllers sit at the intersection of several important areas. Understanding these connections reveals when FSCs are the right tool and when to use something else.

FSCs ↔ Recurrent Neural Networks: A stochastic FSC with a large node set and softmax distributions is equivalent to a recurrent neural network with a discrete hidden state. When the number of nodes grows large and the distributions become continuous, FSCs smoothly transition into RNNs for POMDP problems. This is the theoretical bridge between classical POMDP solving and modern deep RL.

FSCs ↔ Alpha vectors: An n-node FSC produces exactly n alpha vectors (one per node). The value function U_FSC(b) = max_x V_x·b. This is exactly the PWLC form from Chapter 20. Optimizing an FSC is equivalent to finding a set of n alpha vectors that: (1) come from a feasible controller policy (not just any hyperplanes), and (2) maximize V at the initial belief. This is a constrained version of the alpha-vector optimization.

FSCs ↔ Memory-based RL: The FSC optimization with gradient ascent is essentially model-based RL with a fixed memory architecture. The nodes are the memory cells, ψ is the action head, η is the transition head. Training with gradient ascent uses the true model (T, O, R). Without the model, you'd use policy gradient methods with rollouts — the FSC becomes a recurrent policy trained by REINFORCE.

Approach	Runtime complexity	Requires model?	Policy type
Belief-based (PBVI)	O(\|S\|) per step	Yes (T, O at runtime)	b → a
Online (POMCP)	O(n_sims) per step	Generative model	h → a (history)
FSC (n nodes)	O(1) per step	No (only at train time)	x → a (node)
RNN (continuous)	O(\|h\|²) per step	No (model-free)	h_t → a (hidden state)

A finite state controller with n nodes corresponds to a PWLC value function with how many alpha vectors?

Exactly n (one per node, each giving V_x(s) for all s) n² (one per node pair) n·|A| (one per node-action pair)

Chapter 9: Summary

Finite state controllers offer a compelling alternative to belief-based POMDP policies. Here's the complete picture of what we covered:

Topic	Key result
FSC structure	Tuple (X, ψ, η): nodes, action dist, transition dist
Evaluation	Linear system of size \|X\|\|S\|; produces n alpha vectors
Policy iteration	Evaluate → improve (LP) → repeat; convergence guaranteed
Node expansion	Add nodes when stuck; each expansion improves value
NLP formulation	Joint optimization of (ψ, η, V); bilinear, NP-hard in general
Gradient ascent	Differentiate through Bellman system; subject to local optima
Runtime	O(1) per step: no belief update, no model needed

When to use FSCs: (1) When runtime resources are severely constrained (embedded systems). (2) When the POMDP model is not available at execution time. (3) When you need an interpretable policy (small FSCs can be read and reasoned about). (4) When you want a complement to online methods: use offline FSC optimization as a starting point, then refine with online planning from the current belief.

The broader lesson: FSCs are the simplest member of a family of memory-based policy representations. The family includes: FSCs (discrete, finite memory), RNNs (continuous, bounded memory), transformers (attention-based memory), and model-based agents (unbounded memory via a world model). As you increase the expressive power of the memory, you increase the approximation quality but also the computational and sample complexity of optimization. FSCs occupy the "smallest viable" end of this spectrum.

The POMDP trilogy: Chapter 20 (exact) → Chapter 21 (offline approx) → Chapter 22 (online) all require maintaining belief at runtime. Chapter 23 (controllers) breaks this pattern: the controller embeds the "belief" into its internal state, computed once offline by optimization, and requires no Bayesian inference at runtime. All four approaches are valid; the right choice depends on problem structure, resource constraints, and interpretability requirements.

FSC policy iteration alternates between evaluation and improvement. What structure does the improvement step have?

A linear program (LP), because V is fixed from evaluation so the objective is linear in (ψ, η) A quadratic program (QP), because ψ and η interact nonlinearly A dynamic program, same as MDP value iteration