Kochenderfer et al., Chapter 23

Controller Abstractions

What if you never computed a belief? Finite state controllers let you act optimally under partial observability with nothing but a state machine — no Bayes, no probability, just nodes and transitions.

Prerequisites: Chapters 20–22. POMDP basics, alpha vectors, offline and online planning.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Controllers?

You've just solved a POMDP. The algorithm hands you a policy π(b) → a. At runtime, you observe o, update your belief b, and query the policy. Sounds fine — until you realize that "updating your belief" requires knowing T(s'|s,a) and O(o|a,s') in full, and that a probability distribution over 10,000 states must be maintained exactly at each step.

What if the robot just doesn't know its model precisely? What if the state space is so large that maintaining a full distribution is prohibitive? What if you need to deploy to an embedded system with 256KB of RAM?

Finite state controllers (FSCs) sidestep the belief entirely. A controller maintains its own internal node x ∈ X = {x1, ..., xn}. It selects actions based on x, not on any probability distribution. It transitions x → x' based on the action taken and the observation received. No Bayes. No matrix multiplication. Just a lookup table.

The bet: You're betting that a small finite set of nodes can capture enough "memory" of past observations to act near-optimally. This is a lossy compression of history. The payoff: zero runtime complexity beyond a table lookup. The risk: finite nodes may not be enough for some problems.

Belief-based policy:

Runtime: maintain b(s) for all s. Update b via Bayes after each (a,o). Query π(b). Memory: O(|S|). Requires full POMDP model at runtime.

Controller policy:

Runtime: maintain current node x. Sample action from ψ(·|x). Transition x via η(·|x,a,o). Memory: O(|X|·|A|·|O|). No model needed at runtime.

Controllers generalize conditional plans: A conditional plan (Chapter 20) is a tree — it grows exponentially with depth and can only represent finite-horizon policies. A controller is a directed graph with cycles. A 3-node controller can represent an infinite-horizon policy that would require an infinite tree. This is the same relationship as FSMs vs. parse trees in formal language theory.
What is the primary runtime advantage of a finite state controller over a belief-based policy?

Chapter 1: FSC Structure

A finite state controller is a tuple (X, ψ, η) where:

ComponentNotationMeaning
Node setX = {x1, ..., xn}Internal states of the controller
Action distributionψ(a|x)P(take action a | currently at node x)
Successor distributionη(x'|x, a, o)P(go to node x' | at x, took a, observed o)

Execution: at node x, sample action a ~ ψ(·|x). Environment produces observation o. Transition to x' ~ η(·|x, a, o). Repeat forever. The policy is entirely encoded in (ψ, η) — two lookup tables.

For the crying baby problem, a simple 2-node controller captures the intuitive policy:

Crying Baby FSC (2-node example)

Node 1 (teal): "default" — take the listen action. If observe crying, jump to Node 2. If quiet, stay. Node 2 (orange): "urgent" — take the feed action. If quiet, return to Node 1. If still crying, stay. Click Animate to watch the controller run for 10 steps.

Step 0 | x=1 | —
Stochastic controllers matter for optimization: Both ψ and η are probability distributions, not deterministic functions. This is crucial for optimization: it makes the policy space continuous and differentiable. A deterministic controller has discrete choices (hard to optimize with gradients). A stochastic controller has real-valued parameters in [0,1] (amenable to gradient ascent and nonlinear programming).
Special case — deterministic controller: When ψ(a|x) ∈ {0,1} and η(x'|x,a,o) ∈ {0,1}, the controller is a deterministic FSM. At each node x, it always takes the same action and always transitions to the same next node given (a,o). Deterministic controllers are simpler to interpret but harder to optimize (the parameter space is discrete).
In an FSC, what determines which action the agent takes at each step?

Chapter 2: Evaluating FSCs

Given a controller, how much reward does it actually get? We need to compute Vx(s) = expected discounted return starting in state s and controller node x.

The evaluation is defined by a system of linear equations — one per (x, s) pair. For a stochastic controller:

Vx(s) = ∑a ψ(a|x) &left;[ R(s,a) + γ ∑s' T(s'|s,a) ∑o O(o|a,s') ∑x' η(x'|x,a,o) Vx'(s') &right;]

This looks complicated, but for a fixed controller it's just a linear system: |X|·|S| equations in |X|·|S| unknowns. Solve it once (e.g., with Gaussian elimination) and you have the value of the controller from every (x, s) pair.

From values to belief values: Given the per-(x,s) values Vx(s) and an initial belief b0, the expected value of starting the controller in node xinit is: Vxinit(b0) = ∑s b0(s) Vxinit(s). This is a dot product — each column of V gives an alpha vector! So an n-node controller produces exactly n alpha vectors. This links controllers directly to the alpha-vector machinery from Chapters 20-21.
The computational benefit of evaluation: Evaluating a fixed controller takes O((|X||S|)2) for Gaussian elimination — polynomial in controller size. Evaluating the optimal value function over the belief simplex is PSPACE-complete. So controllers give a polynomial-time computable lower bound on U*(b).
julia
function evaluate_controller(pomdp, fsc)
    nS, nX = length(states(pomdp)), length(fsc.nodes)
    # Build linear system: V = R + γ T O η V
    # Row idx: (x-1)*nS + s. Build (nX*nS) × (nX*nS) matrix.
    A = zeros(nX*nS, nX*nS)
    b = zeros(nX*nS)

    for (xi, x) in enumerate(fsc.nodes)
        for (si, s) in enumerate(states(pomdp))
            row = (xi-1)*nS + si
            A[row, row] = 1.0  # V[x,s] coefficient
            exp_r = 0.0
            for a in actions(pomdp)
                pa = fsc.ψ(a, x)
                exp_r += pa * reward(pomdp, s, a)
                for (si2, s2) in enumerate(states(pomdp))
                    for o in observations(pomdp)
                        for (xi2, x2) in enumerate(fsc.nodes)
                            col = (xi2-1)*nS + si2
                            A[row, col] -= discount(pomdp) * pa *
                                transition(pomdp,s,a,s2) *
                                obs(pomdp,a,s2,o) *
                                fsc.η(x2,x,a,o)
                        end
                    end
                end
            end
            b[row] = exp_r
        end
    end
    V_flat = A \ b
    return reshape(V_flat, nS, nX)  # V[s, x] = value at state s, node x
end
A finite state controller with n nodes produces how many alpha vectors?

Chapter 3: Policy Iteration for Controllers

We know how to evaluate a given controller. Now: how do we improve it? Policy iteration for FSCs alternates between evaluation and a greedy improvement step — the same structure as MDP policy iteration.

The improvement step: given the current value function Vx(s) and the current controller (ψ, η), find new distributions (ψ*, η*) that maximize the value at each (x, s):

(ψ*, η*) = argmaxψ, ηx, s bx(s) Vx(s)

where bx(s) is the probability of being in state s at node x (the stationary distribution over (x,s) pairs). This is a linear program in the controller parameters: maximize a linear objective over the simplex constraints ∑a ψ(a|x)=1 and ∑x' η(x'|x,a,o)=1.

Initialize
Start with a random or heuristic controller (ψ(0), η(0)). Fix n = |X|.
Evaluate
Solve the linear system to get Vx(s) for all (x,s).
Improve
Solve an LP to find (ψ*, η*) that maximize weighted value.
Check convergence
If value improved by less than ε, stop. Otherwise, update (ψ,η) := (ψ*,η*) and repeat.
Policy iteration converges: Each improvement step is guaranteed to be non-decreasing in value. Since the controller parameter space is compact, and the value is bounded above by U*(b), the algorithm must converge. In practice, it converges quickly — often in 5-15 iterations for small problems.
The LP formulation: The improvement step is a linear program because the value function Vx(s) is fixed (from the evaluation step), so the objective ∑x,s bx(s) Vx(s) is linear in the occupancy weights. Maximizing a linear function over the product of simplices is always a linear program, solvable in polynomial time.
Why is FSC policy iteration guaranteed to converge?

Chapter 4: Node Expansion

Policy iteration with a fixed n-node controller finds the best policy expressible with n nodes. But what if n is too small? The controller might be fundamentally limited, unable to represent anything near the optimal policy regardless of what (ψ, η) we choose.

Node expansion increases n when the current controller has converged to a local optimum. The idea: add a new node xnew and initialize it with a useful purpose. The most effective strategy is to initialize xnew as a copy of the existing node that is most frequently visited weighted by missed value.

After adding xnew, run policy iteration again. The new node gives the optimization more freedom to represent policies that require more "memory." Repeat expansion until the value increase per expansion is below threshold.

Why start small? Starting with n=1 and expanding is better than starting with a large n because: (1) small controllers are faster to optimize (fewer parameters), (2) smaller controllers are easier to interpret, and (3) greedy expansion adds nodes where they help most, rather than allocating n nodes all at once and hoping they find useful roles. This is analogous to neuron growth in neural network architectures.
Controller Size vs. Policy Quality

Crying Baby POMDP. X-axis: number of controller nodes n. Y-axis: value at initial belief P(hungry)=0.5. More nodes = better value (up to a point where near-optimal is reached). Click a bar to see the value.

Click to run
Optimal number of nodes: There's always a minimum n above which more nodes don't help. For the crying baby problem, n=2 or n=3 achieves near-optimal performance. For richer problems (e.g., robot navigation with many landmarks), you might need n=20-50. The key question is not "how many nodes" but "how much history do I need to remember?"
Connection to alpha vectors: An n-node controller produces n alpha vectors {Vx1, ..., Vxn}. The value at belief b is maxx Vx·b. So evaluating a controller is equivalent to running PBVI with a belief-dependent starting node. Node expansion is equivalent to adding a new alpha vector to Γ. The two approaches are dual.
Node expansion adds a new controller node when policy iteration has converged. Why does this help?

Chapter 5: Nonlinear Programming Formulation

Policy iteration for FSCs requires solving an LP in the improvement step. But LPs assume a fixed evaluation. What if we want to jointly optimize both the evaluation and the improvement, treating (ψ, η, V) as one big optimization problem?

The FSC optimization over a fixed n can be written as a nonlinear program (NLP). The optimization variables are:

The objective: maximize Vx0·b0 = ∑s b0(s) Vx0(s) for an initial belief b0 and starting node x0.

Why nonlinear? The Bellman constraint that V must satisfy is nonlinear in (ψ, η, V): it multiplies ψ and η with V. The constraint: Vx,s = ∑a ψa,x[R(s,a) + γ∑s',o,x' T(s'|s,a)O(o|a,s')ηx',x,a,oVx',s']. This is bilinear in (ψ,V) and (η,V) — hence NLP, not LP.
Bilinear programs: The NLP here is specifically a bilinear program: products of two sets of variables. Bilinear programs are NP-hard in general, but the structure here (compact constraint set, well-conditioned Bellman equations) makes gradient-based methods work well in practice. The local optima correspond to locally-good controllers that might not be globally optimal.
python (NLP formulation sketch)
from scipy.optimize import minimize
import numpy as np

def fsc_nlp(pomdp, n_nodes, b0):
    # Pack ψ and η into a single parameter vector θ
    nS, nA, nO = pomdp.nS, pomdp.nA, pomdp.nO
    nX = n_nodes

    def unpack(theta):
        # ψ: (nX, nA) softmax'd per row
        psi_raw = theta[:nX*nA].reshape(nX, nA)
        psi = softmax(psi_raw, axis=1)
        # η: (nX, nA, nO, nX) softmax'd per (x,a,o)
        eta_raw = theta[nX*nA:].reshape(nX, nA, nO, nX)
        eta = softmax(eta_raw, axis=3)
        return psi, eta

    def objective(theta):
        psi, eta = unpack(theta)
        V = evaluate_controller(pomdp, psi, eta)  # solve linear system
        return -np.dot(b0, V[0, :])  # maximize value at node 0

    theta0 = np.random.randn(nX*nA + nX*nA*nO*nX) * 0.1
    result = minimize(objective, theta0, method='L-BFGS-B')
    return unpack(result.x)
The FSC optimization is a nonlinear program rather than a linear program because:

Chapter 6: Gradient Ascent for Controllers

NLP solvers (L-BFGS, SLSQP) require computing gradients. For FSCs, we need ∂Vx0(b0) / ∂ψ and ∂Vx0(b0) / ∂η. These can be computed by differentiating through the linear system that defines V.

The gradient of V with respect to controller parameters can be derived from the Bellman equations. Let L = I − γP be the matrix defining the linear system LV = r (where r is the expected reward vector and P encodes the transition and controller dynamics). Then:

∂V / ∂θ = L−1 ∂r / ∂θ − L−1 (∂P / ∂θ) V

where θ represents any controller parameter (ψa,x or ηx',x,a,o). This can be computed efficiently by: (1) solving LV = r for V (forward pass), (2) computing the right-hand side, (3) solving Lλ = c for an adjoint vector (backward pass). The full gradient costs O(2 · |X||S| linear system solve).

Gradient Ascent on a 2-Node Controller

Optimize ψ and η for the 2-node crying baby controller via gradient ascent. X-axis: gradient step. Y-axis: value at belief b=[0.5,0.5]. Click Step to take one gradient step, or Run 50 to converge. Watch the value climb.

Step 0 | V(b=0.5) = —
Learning rate α0.020
Local optima are a real problem: Gradient ascent converges to local optima, not global. For FSCs, the landscape can have many local optima corresponding to different "strategies" encoded by the nodes. Running gradient ascent from multiple random initializations and keeping the best result (multi-start optimization) is the standard approach. Adding nodes and re-optimizing also helps escape shallow local optima.
Natural gradient: The standard gradient treats all parameters equally. The natural gradient scales by the Fisher information matrix, accounting for the fact that small changes in ψ near the boundary (e.g., ψ=0.01) have different effects than near the center (e.g., ψ=0.5). Natural gradient methods converge faster for FSC optimization, at the cost of computing the Fisher matrix.
Gradient ascent for FSC optimization is subject to local optima. What is the standard remedy?

Chapter 7: SHOWCASE — FSC Running Live

Watch a finite state controller execute on the crying baby POMDP over 50 steps. The controller's internal node is shown as the active node in the graph. You can see how the controller responds to crying vs. quiet observations, feeding when appropriate and listening to gather information. Compare with the belief-based policy: both make similar decisions, but the controller never computes a probability.

FSC Execution on Crying Baby POMDP

Left: the FSC graph (2 nodes, 3 actions). Highlighted = current node. Right: the actual world state and observations over time. Each step: controller picks an action, environment transitions, observation arrives, controller transitions. Total reward shown below.

Step 0 | Node: x1 | Total reward: 0
Speed400ms
What to notice: The controller uses "listen" when in node x1 (default mode) to gather information, then "feed" in node x2 (urgent mode) when the baby has cried. It doesn't know the true state (hungry/not), but the observation signal (crying/quiet) through the 2-node FSC captures the essential memory needed. Over many steps, the total reward approaches the value predicted by evaluation: Vx1(b0 = [0.5, 0.5]).
Compare to belief-based execution: A belief-based policy would maintain P(hungry) and update it via Bayes at each step. The FSC achieves similar behavior with no probability computation. The node x1="listen" corresponds loosely to low P(hungry); x2="feed" corresponds to high P(hungry). The controller learns a useful discretization of belief space, even though it never computes beliefs.
An FSC executing on a POMDP never updates a belief distribution. How does it handle uncertainty about the state?

Chapter 8: Connections

Finite state controllers sit at the intersection of several important areas. Understanding these connections reveals when FSCs are the right tool and when to use something else.

FSCs ↔ Recurrent Neural Networks: A stochastic FSC with a large node set and softmax distributions is equivalent to a recurrent neural network with a discrete hidden state. When the number of nodes grows large and the distributions become continuous, FSCs smoothly transition into RNNs for POMDP problems. This is the theoretical bridge between classical POMDP solving and modern deep RL.
FSCs ↔ Alpha vectors: An n-node FSC produces exactly n alpha vectors (one per node). The value function UFSC(b) = maxx Vx·b. This is exactly the PWLC form from Chapter 20. Optimizing an FSC is equivalent to finding a set of n alpha vectors that: (1) come from a feasible controller policy (not just any hyperplanes), and (2) maximize V at the initial belief. This is a constrained version of the alpha-vector optimization.
FSCs ↔ Memory-based RL: The FSC optimization with gradient ascent is essentially model-based RL with a fixed memory architecture. The nodes are the memory cells, ψ is the action head, η is the transition head. Training with gradient ascent uses the true model (T, O, R). Without the model, you'd use policy gradient methods with rollouts — the FSC becomes a recurrent policy trained by REINFORCE.
ApproachRuntime complexityRequires model?Policy type
Belief-based (PBVI)O(|S|) per stepYes (T, O at runtime)b → a
Online (POMCP)O(n_sims) per stepGenerative modelh → a (history)
FSC (n nodes)O(1) per stepNo (only at train time)x → a (node)
RNN (continuous)O(|h|2) per stepNo (model-free)ht → a (hidden state)
A finite state controller with n nodes corresponds to a PWLC value function with how many alpha vectors?

Chapter 9: Summary

Finite state controllers offer a compelling alternative to belief-based POMDP policies. Here's the complete picture of what we covered:

TopicKey result
FSC structureTuple (X, ψ, η): nodes, action dist, transition dist
EvaluationLinear system of size |X||S|; produces n alpha vectors
Policy iterationEvaluate → improve (LP) → repeat; convergence guaranteed
Node expansionAdd nodes when stuck; each expansion improves value
NLP formulationJoint optimization of (ψ, η, V); bilinear, NP-hard in general
Gradient ascentDifferentiate through Bellman system; subject to local optima
RuntimeO(1) per step: no belief update, no model needed
When to use FSCs: (1) When runtime resources are severely constrained (embedded systems). (2) When the POMDP model is not available at execution time. (3) When you need an interpretable policy (small FSCs can be read and reasoned about). (4) When you want a complement to online methods: use offline FSC optimization as a starting point, then refine with online planning from the current belief.
The broader lesson: FSCs are the simplest member of a family of memory-based policy representations. The family includes: FSCs (discrete, finite memory), RNNs (continuous, bounded memory), transformers (attention-based memory), and model-based agents (unbounded memory via a world model). As you increase the expressive power of the memory, you increase the approximation quality but also the computational and sample complexity of optimization. FSCs occupy the "smallest viable" end of this spectrum.
The POMDP trilogy: Chapter 20 (exact) → Chapter 21 (offline approx) → Chapter 22 (online) all require maintaining belief at runtime. Chapter 23 (controllers) breaks this pattern: the controller embeds the "belief" into its internal state, computed once offline by optimization, and requires no Bayesian inference at runtime. All four approaches are valid; the right choice depends on problem structure, resource constraints, and interpretability requirements.
FSC policy iteration alternates between evaluation and improvement. What structure does the improvement step have?