Ronald J. Williams, 1992

REINFORCE

Simple statistical gradient-following algorithms for connectionist reinforcement learning — the foundational policy gradient method that learns by correlating rewards with the log-probability of actions taken.

Prerequisites: Probability basics + Gradient descent
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have a network. It takes inputs and produces outputs. But unlike supervised learning, nobody tells you the correct output. Instead, you get a single number: a reward. A scalar reinforcement signal r that says "that was good" or "that was bad" — but not why, and not how to improve.

This is the reinforcement learning problem. And in 1992, when Williams wrote this paper, the fundamental question was: how do you compute a gradient when the only feedback is a scalar reward?

In supervised learning, you have a target y* and can compute ∇L = ∇||y - y*||². Easy. But in RL, there's no target. The output is drawn stochastically from a distribution, and the reward depends on what the environment does in response. How do you take a gradient through randomness?

The credit assignment problem: When a network of stochastic units produces an output and gets reward r, which unit's randomness was responsible? Did unit 3's coin flip help or hurt? The reward gives you one number for the entire network — you need to figure out what each weight contributed. This is the problem REINFORCE solves.

Existing approaches in 1992 had painful limitations. Model-based methods learned an explicit model of the reward function — expensive and complex. Systematic search methods like A* only worked in specific settings. What Williams wanted was something simple: an algorithm that follows the gradient of expected reward without ever explicitly computing that gradient or storing information to estimate it.

What makes reinforcement learning fundamentally harder than supervised learning from a gradient computation perspective?

Chapter 1: The Key Insight

The trick is beautiful in its simplicity. We want to maximize E{r | W} — the expected reward given the current weights W. We can't compute ∇WE{r | W} directly because the reward function is unknown and the outputs are stochastic.

But we can get an unbiased estimate of this gradient from a single trial. Here's how.

Each unit i produces output yi drawn from a distribution gi(yi, wi, xi) parameterized by its weights wi and inputs xi. After the whole network produces its output, we observe reward r. The REINFORCE update rule is:

Δwij = αij (r − bij) · ∂ln gi / ∂wij

That's the entire algorithm. Three pieces multiplied together:

  1. αij — a non-negative learning rate
  2. (r − bij) — the reward minus a baseline (how much better or worse than expected)
  3. ∂ln gi / ∂wij — the "characteristic eligibility" (how much wij influenced the randomness)
The acronym tells the story: REINFORCE = REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility. The weight update is the product of a positive learning rate, the reward offset from baseline, and a term capturing how the weight influenced the stochastic output.

The magic: Williams proves that this simple update, averaged over all possible stochastic outcomes, points in the direction of ∇WE{r | W}. It statistically climbs the gradient of expected reward without ever computing that gradient explicitly.

This is the log-derivative trick (also called the score function estimator or likelihood ratio trick). It's the foundation of every modern policy gradient algorithm — PPO, A3C, TRPO all trace their lineage directly to this equation.

What does the term ∂ln gi / ∂wij (the "characteristic eligibility") capture?

Chapter 2: Stochastic Units

REINFORCE works with any unit that produces output stochastically from a differentiable distribution. The paper explores several types, but the key ones are:

Bernoulli semilinear units

The simplest: output yi ∈ {0, 1}. The probability of output 1 is computed by a squashing function applied to the weighted sum of inputs:

pi = f(si),   where   si = wiTxi = ∑j wijxj

With the logistic squashing function f(s) = 1/(1 + e−s), this is called a Bernoulli-logistic unit. Its characteristic eligibility works out to a remarkably clean form:

∂ln gi / ∂wij = (yi − pi) xj

The elegance: the eligibility is just (actual output minus expected output) times the input. If the unit fired (yi=1) when it wasn't likely to (pi low), the eligibility is large and positive. If it didn't fire when it was likely to, the eligibility is large and negative.

Deriving the eligibility step by step: For the Bernoulli-logistic unit: (1) gi = piyi(1−pi)1−yi, so ln gi = yi ln pi + (1−yi) ln(1−pi). (2) ∂ln gi/∂pi = (yi−pi)/(pi(1−pi)). (3) For the logistic, f′(si) = pi(1−pi). (4) By chain rule: ∂ln gi/∂wij = (yi−pi)/(pi(1−pi)) · pi(1−pi) · xj = (yi−pi)xj. The logistic's derivative perfectly cancels the Bernoulli denominator.

Gaussian units

Output yi ∈ ℝ, drawn from N(μ, σ²). Two adaptable parameters: the mean μ and standard deviation σ. The characteristic eligibilities are:

∂ln g / ∂μ = (y − μ) / σ²
∂ln g / ∂σ = ((y − μ)² − σ²) / σ³

The μ eligibility has the same "actual minus expected" form as the Bernoulli case — this turns out to be true for all exponential family distributions (Proposition 1 in the paper).

Stochastic Units: Bernoulli vs Gaussian

Click "Sample" to draw outputs from each unit type. The eligibility (how much the weight should change) depends on where the sample falls relative to the expected value.

Click to sample
For a Bernoulli-logistic unit, the characteristic eligibility ∂ln gi/∂wij = (yi − pi)xj. Why does the logistic function's derivative cancel so cleanly?

Chapter 3: The REINFORCE Theorem

The central result of the paper — Theorem 1 — establishes that REINFORCE algorithms are statistical gradient-followers. Let's build the proof.

What we want to show

That E{ΔW | W} (the expected weight update) points in the same direction as ∇WE{r | W} (the gradient of expected reward). More precisely, that their inner product is non-negative, and equals zero only at stationary points.

The proof sketch

Start with the expected reward conditioned on the input xi to unit i:

E{r | W, xi} = ∑ξ E{r | W, xi, yi=ξ} · gi(ξ, wi, xi)

Differentiate both sides with respect to wij. Since E{r | W, xi, yi=ξ} doesn't depend on wij (once yi is fixed, the weight's only influence was through yi):

∂E{r | W, xi} / ∂wij = ∑ξ E{r | W, xi, yi=ξ} · ∂gi/∂wij

Now multiply and divide by gi:

= ∑ξ E{r | W, xi, yi=ξ} · (∂ln gi/∂wij) · gi(ξ)

This is E{r · ∂ln gi/∂wij | W, xi}. And since ∑ (∂gi/∂wij) = 0 (probabilities sum to 1), the baseline bij drops out:

E{(r − bij) · ∂ln gi/∂wij | W, xi} = ∂E{r | W, xi} / ∂wij
The punchline: When αij = α for all i,j: E{ΔW | W} = α ∇WE{r | W}. The expected REINFORCE update IS the gradient of expected reward, scaled by the learning rate. This is an unbiased gradient estimator — no approximation, no model of the reward function needed.

Why the baseline doesn't bias the gradient

The baseline bij can be anything that doesn't depend on the current output yi. Why? Because ∑ξ (∂gi/∂wij) = ∂(∑ gi)/∂wij = ∂1/∂wij = 0. The baseline term gets multiplied by this zero sum and vanishes in expectation. The baseline doesn't change the expected gradient direction — it only changes the variance of the estimate.

Why can we subtract any baseline b (independent of yi) from the reward r without biasing the gradient estimate?

Chapter 4: Characteristic Eligibility

The term eij = ∂ln gi/∂wij is the heart of REINFORCE. Williams calls it the characteristic eligibility — it measures how "eligible" each weight is for credit (or blame) based on the stochastic output that occurred.

The log-derivative trick

Why take the log of gi? Because of a remarkable identity:

θ Ex~p(θ)[f(x)] = Ex~p(θ)[f(x) · ∇θ ln p(x; θ)]

The gradient of an expectation under a parameterized distribution equals the expectation of the function times the gradient of the log-probability. This lets us estimate the gradient by sampling: draw x from p(θ), compute f(x) · ∇ ln p(x; θ), and average. No need to differentiate through f or even know its form.

Eligibility for different unit types

For each unit type, we can compute the eligibility in closed form:

Bernoulli-logistic
eij = (yi − pi) xj
Gaussian (mean μ)
eμ = (y − μ) / σ²
Gaussian (std σ)
eσ = ((y − μ)² − σ²) / σ³
Exponential family
eμ = (y − μ) / σ² (general form, Proposition 1)
The universal pattern: For every exponential family distribution, the eligibility of the mean parameter has the form (y − μ)/σ². Bernoulli, Gaussian, Poisson, exponential — all follow this pattern. The eligibility is always "how surprising was this output?" normalized by the variance. This is Proposition 1 in the paper, and it's why REINFORCE generalizes so broadly.
Characteristic Eligibility Visualization

The eligibility measures how much a weight contributed to the stochastic outcome. High positive eligibility = the weight pushed toward this outcome; high negative = it pushed away. The REINFORCE update multiplies this by the reward signal.

The log-derivative trick converts ∇E[f(x)] into E[f(x) · ∇ ln p(x)]. Why is this useful?

Chapter 5: Baselines

REINFORCE gives an unbiased gradient estimate, but unbiased doesn't mean useful. The variance can be enormous. If rewards are always positive (say, between 0 and 1), the update always reinforces every action — just more for higher rewards. All gradients point the same direction; the signal-to-noise ratio is terrible.

The variance reduction trick

The baseline bij doesn't change the expected gradient (proven in Chapter 3), but it dramatically changes the variance. The update becomes:

Δwij = α (r − b) · ∂ln gi/∂wij

With a good baseline, the term (r − b) is sometimes positive (better than baseline — reinforce) and sometimes negative (worse than baseline — anti-reinforce). This balanced signal has much lower variance than raw r.

Reinforcement comparison

The most common baseline: an exponential moving average of recent rewards:

r̂(t) = γ r(t−1) + (1 − γ) r̂(t−1)

This tracks the "expected" reward, so (r − r̂) represents "how much better or worse than recent experience." The update becomes a Bernoulli-logistic REINFORCE with reinforcement comparison:

Δwij = α (r − r̂)(yi − pi) xj
Why baselines matter — a worked example: Suppose rewards are always in [0.8, 1.0]. Without baseline (b=0): every update reinforces the action taken, with magnitude varying only between 0.8 and 1.0. The gradient signal is drowned in noise. With baseline b=0.9: updates range from −0.1 to +0.1. Now the sign of the update actually depends on whether the action was above or below average — a much clearer learning signal.

The optimal baseline

Williams mentions that the variance-minimizing baseline is not simply the mean reward, but a more complex quantity involving the eligibility: b* = E{r · e²} / E{e²}. This was further studied by Dayan (1990), who found only modest improvements over the simpler mean-reward baseline.

Baseline Effect on Gradient Variance

Drag the baseline slider to see how it affects the variance of the gradient estimate. Rewards are drawn uniformly from [0.5, 1.0]. Without baseline, all updates are positive. With baseline near the mean, updates are balanced around zero with much lower variance.

Baseline b0.00
Why does subtracting a baseline from the reward reduce the variance of the REINFORCE gradient estimate without introducing bias?

Chapter 6: Episodic REINFORCE

So far we've considered immediate reinforcement — one input, one output, one reward. But many real problems involve episodes: a sequence of k timesteps, with a single reward r delivered at the end.

Unfolding in time

Williams handles this by "unfolding" the recurrent network through time — the same idea behind backpropagation through time (BPTT). A network N running for k steps becomes an acyclic network N* with k copies of each unit.

The episodic REINFORCE update accumulates eligibilities across all timesteps:

Δwij = αij (r − bij) ∑t=1k eij(t)

For Bernoulli-logistic units in a recurrent network:

Δwij = α (r − b) ∑t=1k [yi(t) − pi(t)] xj(t−1)
Eligibility as a running accumulator: The sum ∑t eij(t) can be computed online with a single accumulator per weight. Each timestep, add the local eligibility to the accumulator. When the reward arrives, multiply the accumulated eligibility by (r − b) and use that as the weight update. No backward pass, no stored trajectories — just running sums.

The limitation: uniform temporal credit

Williams acknowledges that episodic REINFORCE is "especially slow" because it assigns credit uniformly across all timesteps. If the reward was earned by a good action at timestep 42, all 100 timesteps get equal credit. This is why actor-critic methods (which use temporal difference learning to estimate per-timestep value) later proved much more practical — they solve the temporal credit assignment problem that episodic REINFORCE leaves unresolved.

Theorem 2 proves that episodic REINFORCE also follows the gradient: E{ΔW | W} = α ∇WE{r | W}. The proof uses the same structure as Theorem 1, applied to the unfolded network.

Why is episodic REINFORCE slow compared to actor-critic methods?

Chapter 7: Gaussian Units

One of the paper's most prescient contributions: units with learnable exploration. A Gaussian unit has two adaptable parameters — mean μ and standard deviation σ — and the REINFORCE updates for each have distinct, interpretable behaviors.

Mean update: move toward good outcomes

Δμ = αμ (r − b) (y − μ) / σ²

If y gave high reward and y > μ, shift μ upward (toward y). If y gave low reward and y > μ, shift μ downward (away from y). Standard gradient ascent behavior.

Variance update: control exploration

Δσ = ασ (r − b) ((y − μ)² − σ²) / σ³

This is where it gets interesting. The factor ((y − μ)² − σ²) has two regimes:

Adaptive exploration in action: Consider what happens when μ sits atop a reward peak. Most samples close to μ get high reward → (r−b) positive, ((y−μ)²−σ²) negative → σ decreases. The search narrows around the peak — convergence. But if μ is on a flat region, nearby samples give similar reward, and the occasional far sample finds something better → σ increases. The search broadens — exploration. The unit automatically adjusts its exploration radius based on the reward landscape.
Gaussian Unit: Adaptive Exploration

The Gaussian unit searches a 1D reward landscape. Watch how μ (position) and σ (spread) adapt: σ narrows near peaks and widens on plateaus. Click "Run" to see 100 REINFORCE updates.

μ=0.2, σ=0.3
How does a Gaussian REINFORCE unit automatically control its exploration radius?

Chapter 8: Backprop Compatibility

A key insight: REINFORCE doesn't have to work alone. In networks with deterministic hidden units and stochastic output units, we can use backpropagation through the deterministic parts and REINFORCE only at the stochastic units.

The hybrid algorithm

Consider a network with deterministic hidden layers (using ReLU, sigmoid, etc.) and stochastic output units (Bernoulli, Gaussian). The exploration happens at the output layer; the deterministic layers simply compute features. We can decompose the gradient computation:

∂ln g / ∂wij = ∑k ∈ output (∂ln gk / ∂pk) · (∂pk / ∂wij)

The first term is the REINFORCE eligibility at the output. The second term is computed by standard backpropagation through the deterministic layers. The two combine seamlessly.

Backpropagating through random number generators: Williams also considers a more radical idea. For a Gaussian unit, the output can be written as y = μ + σz where z ~ N(0,1). Since y is a differentiable function of μ and σ (for fixed z), we can backpropagate through the sampling operation. This is exactly the reparameterization trick — rediscovered 20 years later by Kingma & Welling (2013) for VAEs. Williams had the idea first.

This compatibility means REINFORCE can be used as a component in larger systems. Use backprop where you can (deterministic layers), and REINFORCE where you must (stochastic layers). The gradient estimates compose correctly because backpropagation preserves unbiasedness.

Williams describes writing a Gaussian unit's output as y = μ + σz (z ~ N(0,1)) and backpropagating through it. What modern technique does this anticipate?

Chapter 9: Connections

What REINFORCE built on

Learning automata (Narendra & Thathatchar, 1989): The LR-I algorithm for 2-action automata is a special case of REINFORCE. Williams showed this existing algorithm was already doing statistical gradient ascent — it just didn't know it.

Backpropagation (Rumelhart, Hinton & Williams, 1986): REINFORCE extends the gradient descent paradigm from supervised to reinforcement learning, and the paper shows how the two compose seamlessly.

Actor-critic methods (Barto, Sutton & Anderson, 1983): The AR-I (Associative Reward-Inaction) algorithm is exactly REINFORCE applied to Bernoulli-logistic networks. Williams unified existing algorithms under one theoretical umbrella.

What REINFORCE enabled

REINFORCE with baseline became the standard policy gradient method used in every subsequent algorithm. PPO, TRPO, A3C all use this exact gradient estimator — they just add variance reduction techniques and trust region constraints on top.

Actor-critic methods (A2C, A3C, GAE): Replace the single reward with a learned value function baseline V(s), giving per-timestep credit assignment. The advantage A(s,a) = r + γV(s′) − V(s) is the modern version of Williams's (r − b).

PPO (Schulman et al., 2017): Adds a clipped surrogate objective that constrains policy updates, preventing the large step sizes that make vanilla REINFORCE unstable. But the underlying gradient estimate is still ∇ ln π(a|s) · A(s,a) — pure REINFORCE with an advantage baseline.

RLHF (Ouyang et al., 2022): Training language models with human feedback uses REINFORCE-style policy gradients. The reward model provides r, the reference policy provides b, and the log-probability gradient is the eligibility. Williams's 1992 equation powers modern AI alignment.

The legacy: REINFORCE is arguably the most consequential algorithm in reinforcement learning. Its core equation — weight update proportional to reward times log-probability gradient — appears in every modern policy gradient method. From Atari-playing agents to language model alignment to robot control, Williams's 1992 insight that you can estimate the gradient of expected reward through the log-derivative trick is the foundation everything else builds on.

Cheat sheet

Core equation
Δwij = α(r − b) ∂ln gi/∂wij
Theorem 1
E{ΔW | W} = α ∇WE{r | W} — unbiased gradient of expected reward
Bernoulli eligibility
eij = (yi − pi) xj
Baseline
Any b independent of yi reduces variance without bias
Gaussian exploration
μ seeks reward peaks; σ auto-adjusts exploration radius
Which modern technique traces its lineage directly to Williams's REINFORCE?