What if you never computed a belief? Finite state controllers let you act optimally under partial observability with nothing but a state machine — no Bayes, no probability, just nodes and transitions.
You've just solved a POMDP. The algorithm hands you a policy π(b) → a. At runtime, you observe o, update your belief b, and query the policy. Sounds fine — until you realize that "updating your belief" requires knowing T(s'|s,a) and O(o|a,s') in full, and that a probability distribution over 10,000 states must be maintained exactly at each step.
What if the robot just doesn't know its model precisely? What if the state space is so large that maintaining a full distribution is prohibitive? What if you need to deploy to an embedded system with 256KB of RAM?
Finite state controllers (FSCs) sidestep the belief entirely. A controller maintains its own internal node x ∈ X = {x1, ..., xn}. It selects actions based on x, not on any probability distribution. It transitions x → x' based on the action taken and the observation received. No Bayes. No matrix multiplication. Just a lookup table.
Belief-based policy:
Runtime: maintain b(s) for all s. Update b via Bayes after each (a,o). Query π(b). Memory: O(|S|). Requires full POMDP model at runtime.
Controller policy:
Runtime: maintain current node x. Sample action from ψ(·|x). Transition x via η(·|x,a,o). Memory: O(|X|·|A|·|O|). No model needed at runtime.
A finite state controller is a tuple (X, ψ, η) where:
| Component | Notation | Meaning |
|---|---|---|
| Node set | X = {x1, ..., xn} | Internal states of the controller |
| Action distribution | ψ(a|x) | P(take action a | currently at node x) |
| Successor distribution | η(x'|x, a, o) | P(go to node x' | at x, took a, observed o) |
Execution: at node x, sample action a ~ ψ(·|x). Environment produces observation o. Transition to x' ~ η(·|x, a, o). Repeat forever. The policy is entirely encoded in (ψ, η) — two lookup tables.
For the crying baby problem, a simple 2-node controller captures the intuitive policy:
Node 1 (teal): "default" — take the listen action. If observe crying, jump to Node 2. If quiet, stay. Node 2 (orange): "urgent" — take the feed action. If quiet, return to Node 1. If still crying, stay. Click Animate to watch the controller run for 10 steps.
Given a controller, how much reward does it actually get? We need to compute Vx(s) = expected discounted return starting in state s and controller node x.
The evaluation is defined by a system of linear equations — one per (x, s) pair. For a stochastic controller:
This looks complicated, but for a fixed controller it's just a linear system: |X|·|S| equations in |X|·|S| unknowns. Solve it once (e.g., with Gaussian elimination) and you have the value of the controller from every (x, s) pair.
julia function evaluate_controller(pomdp, fsc) nS, nX = length(states(pomdp)), length(fsc.nodes) # Build linear system: V = R + γ T O η V # Row idx: (x-1)*nS + s. Build (nX*nS) × (nX*nS) matrix. A = zeros(nX*nS, nX*nS) b = zeros(nX*nS) for (xi, x) in enumerate(fsc.nodes) for (si, s) in enumerate(states(pomdp)) row = (xi-1)*nS + si A[row, row] = 1.0 # V[x,s] coefficient exp_r = 0.0 for a in actions(pomdp) pa = fsc.ψ(a, x) exp_r += pa * reward(pomdp, s, a) for (si2, s2) in enumerate(states(pomdp)) for o in observations(pomdp) for (xi2, x2) in enumerate(fsc.nodes) col = (xi2-1)*nS + si2 A[row, col] -= discount(pomdp) * pa * transition(pomdp,s,a,s2) * obs(pomdp,a,s2,o) * fsc.η(x2,x,a,o) end end end end b[row] = exp_r end end V_flat = A \ b return reshape(V_flat, nS, nX) # V[s, x] = value at state s, node x end
We know how to evaluate a given controller. Now: how do we improve it? Policy iteration for FSCs alternates between evaluation and a greedy improvement step — the same structure as MDP policy iteration.
The improvement step: given the current value function Vx(s) and the current controller (ψ, η), find new distributions (ψ*, η*) that maximize the value at each (x, s):
where bx(s) is the probability of being in state s at node x (the stationary distribution over (x,s) pairs). This is a linear program in the controller parameters: maximize a linear objective over the simplex constraints ∑a ψ(a|x)=1 and ∑x' η(x'|x,a,o)=1.
Policy iteration with a fixed n-node controller finds the best policy expressible with n nodes. But what if n is too small? The controller might be fundamentally limited, unable to represent anything near the optimal policy regardless of what (ψ, η) we choose.
Node expansion increases n when the current controller has converged to a local optimum. The idea: add a new node xnew and initialize it with a useful purpose. The most effective strategy is to initialize xnew as a copy of the existing node that is most frequently visited weighted by missed value.
After adding xnew, run policy iteration again. The new node gives the optimization more freedom to represent policies that require more "memory." Repeat expansion until the value increase per expansion is below threshold.
Crying Baby POMDP. X-axis: number of controller nodes n. Y-axis: value at initial belief P(hungry)=0.5. More nodes = better value (up to a point where near-optimal is reached). Click a bar to see the value.
Policy iteration for FSCs requires solving an LP in the improvement step. But LPs assume a fixed evaluation. What if we want to jointly optimize both the evaluation and the improvement, treating (ψ, η, V) as one big optimization problem?
The FSC optimization over a fixed n can be written as a nonlinear program (NLP). The optimization variables are:
The objective: maximize Vx0·b0 = ∑s b0(s) Vx0(s) for an initial belief b0 and starting node x0.
python (NLP formulation sketch) from scipy.optimize import minimize import numpy as np def fsc_nlp(pomdp, n_nodes, b0): # Pack ψ and η into a single parameter vector θ nS, nA, nO = pomdp.nS, pomdp.nA, pomdp.nO nX = n_nodes def unpack(theta): # ψ: (nX, nA) softmax'd per row psi_raw = theta[:nX*nA].reshape(nX, nA) psi = softmax(psi_raw, axis=1) # η: (nX, nA, nO, nX) softmax'd per (x,a,o) eta_raw = theta[nX*nA:].reshape(nX, nA, nO, nX) eta = softmax(eta_raw, axis=3) return psi, eta def objective(theta): psi, eta = unpack(theta) V = evaluate_controller(pomdp, psi, eta) # solve linear system return -np.dot(b0, V[0, :]) # maximize value at node 0 theta0 = np.random.randn(nX*nA + nX*nA*nO*nX) * 0.1 result = minimize(objective, theta0, method='L-BFGS-B') return unpack(result.x)
NLP solvers (L-BFGS, SLSQP) require computing gradients. For FSCs, we need ∂Vx0(b0) / ∂ψ and ∂Vx0(b0) / ∂η. These can be computed by differentiating through the linear system that defines V.
The gradient of V with respect to controller parameters can be derived from the Bellman equations. Let L = I − γP be the matrix defining the linear system LV = r (where r is the expected reward vector and P encodes the transition and controller dynamics). Then:
where θ represents any controller parameter (ψa,x or ηx',x,a,o). This can be computed efficiently by: (1) solving LV = r for V (forward pass), (2) computing the right-hand side, (3) solving L⊤λ = c for an adjoint vector (backward pass). The full gradient costs O(2 · |X||S| linear system solve).
Optimize ψ and η for the 2-node crying baby controller via gradient ascent. X-axis: gradient step. Y-axis: value at belief b=[0.5,0.5]. Click Step to take one gradient step, or Run 50 to converge. Watch the value climb.
Watch a finite state controller execute on the crying baby POMDP over 50 steps. The controller's internal node is shown as the active node in the graph. You can see how the controller responds to crying vs. quiet observations, feeding when appropriate and listening to gather information. Compare with the belief-based policy: both make similar decisions, but the controller never computes a probability.
Left: the FSC graph (2 nodes, 3 actions). Highlighted = current node. Right: the actual world state and observations over time. Each step: controller picks an action, environment transitions, observation arrives, controller transitions. Total reward shown below.
Finite state controllers sit at the intersection of several important areas. Understanding these connections reveals when FSCs are the right tool and when to use something else.
| Approach | Runtime complexity | Requires model? | Policy type |
|---|---|---|---|
| Belief-based (PBVI) | O(|S|) per step | Yes (T, O at runtime) | b → a |
| Online (POMCP) | O(n_sims) per step | Generative model | h → a (history) |
| FSC (n nodes) | O(1) per step | No (only at train time) | x → a (node) |
| RNN (continuous) | O(|h|2) per step | No (model-free) | ht → a (hidden state) |
Finite state controllers offer a compelling alternative to belief-based POMDP policies. Here's the complete picture of what we covered:
| Topic | Key result |
|---|---|
| FSC structure | Tuple (X, ψ, η): nodes, action dist, transition dist |
| Evaluation | Linear system of size |X||S|; produces n alpha vectors |
| Policy iteration | Evaluate → improve (LP) → repeat; convergence guaranteed |
| Node expansion | Add nodes when stuck; each expansion improves value |
| NLP formulation | Joint optimization of (ψ, η, V); bilinear, NP-hard in general |
| Gradient ascent | Differentiate through Bellman system; subject to local optima |
| Runtime | O(1) per step: no belief update, no model needed |