Simple statistical gradient-following algorithms for connectionist reinforcement learning — the foundational policy gradient method that learns by correlating rewards with the log-probability of actions taken.
You have a network. It takes inputs and produces outputs. But unlike supervised learning, nobody tells you the correct output. Instead, you get a single number: a reward. A scalar reinforcement signal r that says "that was good" or "that was bad" — but not why, and not how to improve.
This is the reinforcement learning problem. And in 1992, when Williams wrote this paper, the fundamental question was: how do you compute a gradient when the only feedback is a scalar reward?
In supervised learning, you have a target y* and can compute ∇L = ∇||y - y*||². Easy. But in RL, there's no target. The output is drawn stochastically from a distribution, and the reward depends on what the environment does in response. How do you take a gradient through randomness?
Existing approaches in 1992 had painful limitations. Model-based methods learned an explicit model of the reward function — expensive and complex. Systematic search methods like A* only worked in specific settings. What Williams wanted was something simple: an algorithm that follows the gradient of expected reward without ever explicitly computing that gradient or storing information to estimate it.
The trick is beautiful in its simplicity. We want to maximize E{r | W} — the expected reward given the current weights W. We can't compute ∇WE{r | W} directly because the reward function is unknown and the outputs are stochastic.
But we can get an unbiased estimate of this gradient from a single trial. Here's how.
Each unit i produces output yi drawn from a distribution gi(yi, wi, xi) parameterized by its weights wi and inputs xi. After the whole network produces its output, we observe reward r. The REINFORCE update rule is:
That's the entire algorithm. Three pieces multiplied together:
The magic: Williams proves that this simple update, averaged over all possible stochastic outcomes, points in the direction of ∇WE{r | W}. It statistically climbs the gradient of expected reward without ever computing that gradient explicitly.
This is the log-derivative trick (also called the score function estimator or likelihood ratio trick). It's the foundation of every modern policy gradient algorithm — PPO, A3C, TRPO all trace their lineage directly to this equation.
REINFORCE works with any unit that produces output stochastically from a differentiable distribution. The paper explores several types, but the key ones are:
The simplest: output yi ∈ {0, 1}. The probability of output 1 is computed by a squashing function applied to the weighted sum of inputs:
With the logistic squashing function f(s) = 1/(1 + e−s), this is called a Bernoulli-logistic unit. Its characteristic eligibility works out to a remarkably clean form:
The elegance: the eligibility is just (actual output minus expected output) times the input. If the unit fired (yi=1) when it wasn't likely to (pi low), the eligibility is large and positive. If it didn't fire when it was likely to, the eligibility is large and negative.
Output yi ∈ ℝ, drawn from N(μ, σ²). Two adaptable parameters: the mean μ and standard deviation σ. The characteristic eligibilities are:
The μ eligibility has the same "actual minus expected" form as the Bernoulli case — this turns out to be true for all exponential family distributions (Proposition 1 in the paper).
Click "Sample" to draw outputs from each unit type. The eligibility (how much the weight should change) depends on where the sample falls relative to the expected value.
The central result of the paper — Theorem 1 — establishes that REINFORCE algorithms are statistical gradient-followers. Let's build the proof.
That E{ΔW | W} (the expected weight update) points in the same direction as ∇WE{r | W} (the gradient of expected reward). More precisely, that their inner product is non-negative, and equals zero only at stationary points.
Start with the expected reward conditioned on the input xi to unit i:
Differentiate both sides with respect to wij. Since E{r | W, xi, yi=ξ} doesn't depend on wij (once yi is fixed, the weight's only influence was through yi):
Now multiply and divide by gi:
This is E{r · ∂ln gi/∂wij | W, xi}. And since ∑ (∂gi/∂wij) = 0 (probabilities sum to 1), the baseline bij drops out:
The baseline bij can be anything that doesn't depend on the current output yi. Why? Because ∑ξ (∂gi/∂wij) = ∂(∑ gi)/∂wij = ∂1/∂wij = 0. The baseline term gets multiplied by this zero sum and vanishes in expectation. The baseline doesn't change the expected gradient direction — it only changes the variance of the estimate.
The term eij = ∂ln gi/∂wij is the heart of REINFORCE. Williams calls it the characteristic eligibility — it measures how "eligible" each weight is for credit (or blame) based on the stochastic output that occurred.
Why take the log of gi? Because of a remarkable identity:
The gradient of an expectation under a parameterized distribution equals the expectation of the function times the gradient of the log-probability. This lets us estimate the gradient by sampling: draw x from p(θ), compute f(x) · ∇ ln p(x; θ), and average. No need to differentiate through f or even know its form.
For each unit type, we can compute the eligibility in closed form:
The eligibility measures how much a weight contributed to the stochastic outcome. High positive eligibility = the weight pushed toward this outcome; high negative = it pushed away. The REINFORCE update multiplies this by the reward signal.
REINFORCE gives an unbiased gradient estimate, but unbiased doesn't mean useful. The variance can be enormous. If rewards are always positive (say, between 0 and 1), the update always reinforces every action — just more for higher rewards. All gradients point the same direction; the signal-to-noise ratio is terrible.
The baseline bij doesn't change the expected gradient (proven in Chapter 3), but it dramatically changes the variance. The update becomes:
With a good baseline, the term (r − b) is sometimes positive (better than baseline — reinforce) and sometimes negative (worse than baseline — anti-reinforce). This balanced signal has much lower variance than raw r.
The most common baseline: an exponential moving average of recent rewards:
This tracks the "expected" reward, so (r − r̂) represents "how much better or worse than recent experience." The update becomes a Bernoulli-logistic REINFORCE with reinforcement comparison:
Williams mentions that the variance-minimizing baseline is not simply the mean reward, but a more complex quantity involving the eligibility: b* = E{r · e²} / E{e²}. This was further studied by Dayan (1990), who found only modest improvements over the simpler mean-reward baseline.
Drag the baseline slider to see how it affects the variance of the gradient estimate. Rewards are drawn uniformly from [0.5, 1.0]. Without baseline, all updates are positive. With baseline near the mean, updates are balanced around zero with much lower variance.
So far we've considered immediate reinforcement — one input, one output, one reward. But many real problems involve episodes: a sequence of k timesteps, with a single reward r delivered at the end.
Williams handles this by "unfolding" the recurrent network through time — the same idea behind backpropagation through time (BPTT). A network N running for k steps becomes an acyclic network N* with k copies of each unit.
The episodic REINFORCE update accumulates eligibilities across all timesteps:
For Bernoulli-logistic units in a recurrent network:
Williams acknowledges that episodic REINFORCE is "especially slow" because it assigns credit uniformly across all timesteps. If the reward was earned by a good action at timestep 42, all 100 timesteps get equal credit. This is why actor-critic methods (which use temporal difference learning to estimate per-timestep value) later proved much more practical — they solve the temporal credit assignment problem that episodic REINFORCE leaves unresolved.
Theorem 2 proves that episodic REINFORCE also follows the gradient: E{ΔW | W} = α ∇WE{r | W}. The proof uses the same structure as Theorem 1, applied to the unfolded network.
One of the paper's most prescient contributions: units with learnable exploration. A Gaussian unit has two adaptable parameters — mean μ and standard deviation σ — and the REINFORCE updates for each have distinct, interpretable behaviors.
If y gave high reward and y > μ, shift μ upward (toward y). If y gave low reward and y > μ, shift μ downward (away from y). Standard gradient ascent behavior.
This is where it gets interesting. The factor ((y − μ)² − σ²) has two regimes:
The Gaussian unit searches a 1D reward landscape. Watch how μ (position) and σ (spread) adapt: σ narrows near peaks and widens on plateaus. Click "Run" to see 100 REINFORCE updates.
A key insight: REINFORCE doesn't have to work alone. In networks with deterministic hidden units and stochastic output units, we can use backpropagation through the deterministic parts and REINFORCE only at the stochastic units.
Consider a network with deterministic hidden layers (using ReLU, sigmoid, etc.) and stochastic output units (Bernoulli, Gaussian). The exploration happens at the output layer; the deterministic layers simply compute features. We can decompose the gradient computation:
The first term is the REINFORCE eligibility at the output. The second term is computed by standard backpropagation through the deterministic layers. The two combine seamlessly.
This compatibility means REINFORCE can be used as a component in larger systems. Use backprop where you can (deterministic layers), and REINFORCE where you must (stochastic layers). The gradient estimates compose correctly because backpropagation preserves unbiasedness.
Learning automata (Narendra & Thathatchar, 1989): The LR-I algorithm for 2-action automata is a special case of REINFORCE. Williams showed this existing algorithm was already doing statistical gradient ascent — it just didn't know it.
Backpropagation (Rumelhart, Hinton & Williams, 1986): REINFORCE extends the gradient descent paradigm from supervised to reinforcement learning, and the paper shows how the two compose seamlessly.
Actor-critic methods (Barto, Sutton & Anderson, 1983): The AR-I (Associative Reward-Inaction) algorithm is exactly REINFORCE applied to Bernoulli-logistic networks. Williams unified existing algorithms under one theoretical umbrella.
REINFORCE with baseline became the standard policy gradient method used in every subsequent algorithm. PPO, TRPO, A3C all use this exact gradient estimator — they just add variance reduction techniques and trust region constraints on top.
Actor-critic methods (A2C, A3C, GAE): Replace the single reward with a learned value function baseline V(s), giving per-timestep credit assignment. The advantage A(s,a) = r + γV(s′) − V(s) is the modern version of Williams's (r − b).
PPO (Schulman et al., 2017): Adds a clipped surrogate objective that constrains policy updates, preventing the large step sizes that make vanilla REINFORCE unstable. But the underlying gradient estimate is still ∇ ln π(a|s) · A(s,a) — pure REINFORCE with an advantage baseline.
RLHF (Ouyang et al., 2022): Training language models with human feedback uses REINFORCE-style policy gradients. The reward model provides r, the reference policy provides b, and the log-probability gradient is the eligibility. Williams's 1992 equation powers modern AI alignment.