REINFORCE — Veanors

Chapter 0: The Problem

You have a network. It takes inputs and produces outputs. But unlike supervised learning, nobody tells you the correct output. Instead, you get a single number: a reward. A scalar reinforcement signal r that says "that was good" or "that was bad" — but not why, and not how to improve.

This is the reinforcement learning problem. And in 1992, when Williams wrote this paper, the fundamental question was: how do you compute a gradient when the only feedback is a scalar reward?

In supervised learning, you have a target y* and can compute ∇L = ∇||y - y*||². Easy. But in RL, there's no target. The output is drawn stochastically from a distribution, and the reward depends on what the environment does in response. How do you take a gradient through randomness?

The credit assignment problem: When a network of stochastic units produces an output and gets reward r, which unit's randomness was responsible? Did unit 3's coin flip help or hurt? The reward gives you one number for the entire network — you need to figure out what each weight contributed. This is the problem REINFORCE solves.

Existing approaches in 1992 had painful limitations. Model-based methods learned an explicit model of the reward function — expensive and complex. Systematic search methods like A* only worked in specific settings. What Williams wanted was something simple: an algorithm that follows the gradient of expected reward without ever explicitly computing that gradient or storing information to estimate it.

What makes reinforcement learning fundamentally harder than supervised learning from a gradient computation perspective?

There's no target output — only a scalar reward — so you can't directly compute the gradient of a loss function The networks are larger The data is not i.i.d.

Chapter 1: The Key Insight

The trick is beautiful in its simplicity. We want to maximize E{r | W} — the expected reward given the current weights W. We can't compute ∇_WE{r | W} directly because the reward function is unknown and the outputs are stochastic.

But we can get an unbiased estimate of this gradient from a single trial. Here's how.

Each unit i produces output y_i drawn from a distribution g_i(y_i, w_i, x_i) parameterized by its weights w_i and inputs x_i. After the whole network produces its output, we observe reward r. The REINFORCE update rule is:

Δw_ij = α_ij (r − b_ij) · ∂ln g_i / ∂w_ij

That's the entire algorithm. Three pieces multiplied together:

α_ij — a non-negative learning rate
(r − b_ij) — the reward minus a baseline (how much better or worse than expected)
∂ln g_i / ∂w_ij — the "characteristic eligibility" (how much w_ij influenced the randomness)

The acronym tells the story: REINFORCE = REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility. The weight update is the product of a positive learning rate, the reward offset from baseline, and a term capturing how the weight influenced the stochastic output.

The magic: Williams proves that this simple update, averaged over all possible stochastic outcomes, points in the direction of ∇_WE{r | W}. It statistically climbs the gradient of expected reward without ever computing that gradient explicitly.

This is the log-derivative trick (also called the score function estimator or likelihood ratio trick). It's the foundation of every modern policy gradient algorithm — PPO, A3C, TRPO all trace their lineage directly to this equation.

What does the term ∂ln g_i / ∂w_ij (the "characteristic eligibility") capture?

How much the weight w_ij influenced the probability of the stochastic output that was actually produced The expected value of the reward The gradient of the reward function

Chapter 2: Stochastic Units

REINFORCE works with any unit that produces output stochastically from a differentiable distribution. The paper explores several types, but the key ones are:

Bernoulli semilinear units

The simplest: output y_i ∈ {0, 1}. The probability of output 1 is computed by a squashing function applied to the weighted sum of inputs:

p_i = f(s_i), where s_i = w_i^Tx_i = ∑_j w_ijx_j

With the logistic squashing function f(s) = 1/(1 + e^−s), this is called a Bernoulli-logistic unit. Its characteristic eligibility works out to a remarkably clean form:

∂ln g_i / ∂w_ij = (y_i − p_i) x_j

The elegance: the eligibility is just (actual output minus expected output) times the input. If the unit fired (y_i=1) when it wasn't likely to (p_i low), the eligibility is large and positive. If it didn't fire when it was likely to, the eligibility is large and negative.

Deriving the eligibility step by step: For the Bernoulli-logistic unit: (1) g_i = p_i^y_i(1−p_i)^1−y_i, so ln g_i = y_i ln p_i + (1−y_i) ln(1−p_i). (2) ∂ln g_i/∂p_i = (y_i−p_i)/(p_i(1−p_i)). (3) For the logistic, f′(s_i) = p_i(1−p_i). (4) By chain rule: ∂ln g_i/∂w_ij = (y_i−p_i)/(p_i(1−p_i)) · p_i(1−p_i) · x_j = (y_i−p_i)x_j. The logistic's derivative perfectly cancels the Bernoulli denominator.

Gaussian units

Output y_i ∈ ℝ, drawn from N(μ, σ²). Two adaptable parameters: the mean μ and standard deviation σ. The characteristic eligibilities are:

∂ln g / ∂μ = (y − μ) / σ²

∂ln g / ∂σ = ((y − μ)² − σ²) / σ³

The μ eligibility has the same "actual minus expected" form as the Bernoulli case — this turns out to be true for all exponential family distributions (Proposition 1 in the paper).

Stochastic Units: Bernoulli vs Gaussian

Click "Sample" to draw outputs from each unit type. The eligibility (how much the weight should change) depends on where the sample falls relative to the expected value.

Click to sample

For a Bernoulli-logistic unit, the characteristic eligibility ∂ln g_i/∂w_ij = (y_i − p_i)x_j. Why does the logistic function's derivative cancel so cleanly?

Because f′(s) = p(1−p) for the logistic, which exactly cancels the p(1−p) denominator from ∂ln g/∂p Because the logistic output is always between 0 and 1 Because the logistic is monotonic

Chapter 3: The REINFORCE Theorem

The central result of the paper — Theorem 1 — establishes that REINFORCE algorithms are statistical gradient-followers. Let's build the proof.

What we want to show

That E{ΔW | W} (the expected weight update) points in the same direction as ∇_WE{r | W} (the gradient of expected reward). More precisely, that their inner product is non-negative, and equals zero only at stationary points.

The proof sketch

Start with the expected reward conditioned on the input x_i to unit i:

E{r | W, x_i} = ∑_ξ E{r | W, x_i, y_i=ξ} · g_i(ξ, w_i, x_i)

Differentiate both sides with respect to w_ij. Since E{r | W, x_i, y_i=ξ} doesn't depend on w_ij (once y_i is fixed, the weight's only influence was through y_i):

∂E{r | W, x_i} / ∂w_ij = ∑_ξ E{r | W, x_i, y_i=ξ} · ∂g_i/∂w_ij

Now multiply and divide by g_i:

= ∑_ξ E{r | W, x_i, y_i=ξ} · (∂ln g_i/∂w_ij) · g_i(ξ)

This is E{r · ∂ln g_i/∂w_ij | W, x_i}. And since ∑ (∂g_i/∂w_ij) = 0 (probabilities sum to 1), the baseline b_ij drops out:

E{(r − b_ij) · ∂ln g_i/∂w_ij | W, x_i} = ∂E{r | W, x_i} / ∂w_ij

The punchline: When α_ij = α for all i,j: E{ΔW | W} = α ∇_WE{r | W}. The expected REINFORCE update IS the gradient of expected reward, scaled by the learning rate. This is an unbiased gradient estimator — no approximation, no model of the reward function needed.

Why the baseline doesn't bias the gradient

The baseline b_ij can be anything that doesn't depend on the current output y_i. Why? Because ∑_ξ (∂g_i/∂w_ij) = ∂(∑ g_i)/∂w_ij = ∂1/∂w_ij = 0. The baseline term gets multiplied by this zero sum and vanishes in expectation. The baseline doesn't change the expected gradient direction — it only changes the variance of the estimate.

Why can we subtract any baseline b (independent of y_i) from the reward r without biasing the gradient estimate?

Because ∑ ∂g_i/∂w_ij = 0 (probabilities sum to 1), so the baseline term vanishes in expectation Because the baseline is always zero Because the baseline is subtracted from both sides

Chapter 4: Characteristic Eligibility

The term e_ij = ∂ln g_i/∂w_ij is the heart of REINFORCE. Williams calls it the characteristic eligibility — it measures how "eligible" each weight is for credit (or blame) based on the stochastic output that occurred.

The log-derivative trick

Why take the log of g_i? Because of a remarkable identity:

∇_θ E_x~p(θ)[f(x)] = E_x~p(θ)[f(x) · ∇_θ ln p(x; θ)]

The gradient of an expectation under a parameterized distribution equals the expectation of the function times the gradient of the log-probability. This lets us estimate the gradient by sampling: draw x from p(θ), compute f(x) · ∇ ln p(x; θ), and average. No need to differentiate through f or even know its form.

Eligibility for different unit types

For each unit type, we can compute the eligibility in closed form:

Bernoulli-logistic

e_ij = (y_i − p_i) x_j

Gaussian (mean μ)

e_μ = (y − μ) / σ²

Gaussian (std σ)

e_σ = ((y − μ)² − σ²) / σ³

Exponential family

e_μ = (y − μ) / σ² (general form, Proposition 1)

The universal pattern: For every exponential family distribution, the eligibility of the mean parameter has the form (y − μ)/σ². Bernoulli, Gaussian, Poisson, exponential — all follow this pattern. The eligibility is always "how surprising was this output?" normalized by the variance. This is Proposition 1 in the paper, and it's why REINFORCE generalizes so broadly.

Characteristic Eligibility Visualization

The eligibility measures how much a weight contributed to the stochastic outcome. High positive eligibility = the weight pushed toward this outcome; high negative = it pushed away. The REINFORCE update multiplies this by the reward signal.

The log-derivative trick converts ∇E[f(x)] into E[f(x) · ∇ ln p(x)]. Why is this useful?

Because we can estimate E[f(x) · ∇ ln p(x)] by sampling — we don't need to differentiate through f or know its form, only through our own policy distribution p Because it makes the computation faster Because the log is easier to compute than the probability

Chapter 5: Baselines

REINFORCE gives an unbiased gradient estimate, but unbiased doesn't mean useful. The variance can be enormous. If rewards are always positive (say, between 0 and 1), the update always reinforces every action — just more for higher rewards. All gradients point the same direction; the signal-to-noise ratio is terrible.

The variance reduction trick

The baseline b_ij doesn't change the expected gradient (proven in Chapter 3), but it dramatically changes the variance. The update becomes:

Δw_ij = α (r − b) · ∂ln g_i/∂w_ij

With a good baseline, the term (r − b) is sometimes positive (better than baseline — reinforce) and sometimes negative (worse than baseline — anti-reinforce). This balanced signal has much lower variance than raw r.

Reinforcement comparison

The most common baseline: an exponential moving average of recent rewards:

r̂(t) = γ r(t−1) + (1 − γ) r̂(t−1)

This tracks the "expected" reward, so (r − r̂) represents "how much better or worse than recent experience." The update becomes a Bernoulli-logistic REINFORCE with reinforcement comparison:

Δw_ij = α (r − r̂)(y_i − p_i) x_j

Why baselines matter — a worked example: Suppose rewards are always in [0.8, 1.0]. Without baseline (b=0): every update reinforces the action taken, with magnitude varying only between 0.8 and 1.0. The gradient signal is drowned in noise. With baseline b=0.9: updates range from −0.1 to +0.1. Now the sign of the update actually depends on whether the action was above or below average — a much clearer learning signal.

The optimal baseline

Williams mentions that the variance-minimizing baseline is not simply the mean reward, but a more complex quantity involving the eligibility: b* = E{r · e²} / E{e²}. This was further studied by Dayan (1990), who found only modest improvements over the simpler mean-reward baseline.

Baseline Effect on Gradient Variance

Drag the baseline slider to see how it affects the variance of the gradient estimate. Rewards are drawn uniformly from [0.5, 1.0]. Without baseline, all updates are positive. With baseline near the mean, updates are balanced around zero with much lower variance.

Baseline b0.00

Why does subtracting a baseline from the reward reduce the variance of the REINFORCE gradient estimate without introducing bias?

The baseline term vanishes in expectation (because ∑ ∂g/∂w = 0), so the mean is unchanged, but the centered signal has lower variance because updates now have mixed signs The baseline makes the learning rate smaller The baseline removes the reward entirely

Chapter 6: Episodic REINFORCE

So far we've considered immediate reinforcement — one input, one output, one reward. But many real problems involve episodes: a sequence of k timesteps, with a single reward r delivered at the end.

Unfolding in time

Williams handles this by "unfolding" the recurrent network through time — the same idea behind backpropagation through time (BPTT). A network N running for k steps becomes an acyclic network N* with k copies of each unit.

The episodic REINFORCE update accumulates eligibilities across all timesteps:

Δw_ij = α_ij (r − b_ij) ∑_t=1^k e_ij(t)

For Bernoulli-logistic units in a recurrent network:

Δw_ij = α (r − b) ∑_t=1^k [y_i(t) − p_i(t)] x_j(t−1)

Eligibility as a running accumulator: The sum ∑_t e_ij(t) can be computed online with a single accumulator per weight. Each timestep, add the local eligibility to the accumulator. When the reward arrives, multiply the accumulated eligibility by (r − b) and use that as the weight update. No backward pass, no stored trajectories — just running sums.

The limitation: uniform temporal credit

Williams acknowledges that episodic REINFORCE is "especially slow" because it assigns credit uniformly across all timesteps. If the reward was earned by a good action at timestep 42, all 100 timesteps get equal credit. This is why actor-critic methods (which use temporal difference learning to estimate per-timestep value) later proved much more practical — they solve the temporal credit assignment problem that episodic REINFORCE leaves unresolved.

Theorem 2 proves that episodic REINFORCE also follows the gradient: E{ΔW | W} = α ∇_WE{r | W}. The proof uses the same structure as Theorem 1, applied to the unfolded network.

Why is episodic REINFORCE slow compared to actor-critic methods?

It assigns credit uniformly across all timesteps, rather than estimating per-timestep value — so one good action at t=42 gets the same credit as an irrelevant action at t=1 It uses a smaller learning rate It requires more memory

Chapter 7: Gaussian Units

One of the paper's most prescient contributions: units with learnable exploration. A Gaussian unit has two adaptable parameters — mean μ and standard deviation σ — and the REINFORCE updates for each have distinct, interpretable behaviors.

Mean update: move toward good outcomes

Δμ = α_μ (r − b) (y − μ) / σ²

If y gave high reward and y > μ, shift μ upward (toward y). If y gave low reward and y > μ, shift μ downward (away from y). Standard gradient ascent behavior.

Variance update: control exploration

Δσ = α_σ (r − b) ((y − μ)² − σ²) / σ³

This is where it gets interesting. The factor ((y − μ)² − σ²) has two regimes:

If |y − μ| < σ (sampled close to mean): the factor is negative
If |y − μ| > σ (sampled far from mean): the factor is positive

Adaptive exploration in action: Consider what happens when μ sits atop a reward peak. Most samples close to μ get high reward → (r−b) positive, ((y−μ)²−σ²) negative → σ decreases. The search narrows around the peak — convergence. But if μ is on a flat region, nearby samples give similar reward, and the occasional far sample finds something better → σ increases. The search broadens — exploration. The unit automatically adjusts its exploration radius based on the reward landscape.

Gaussian Unit: Adaptive Exploration

The Gaussian unit searches a 1D reward landscape. Watch how μ (position) and σ (spread) adapt: σ narrows near peaks and widens on plateaus. Click "Run" to see 100 REINFORCE updates.

μ=0.2, σ=0.3

How does a Gaussian REINFORCE unit automatically control its exploration radius?

When good rewards come from samples close to μ, σ decreases (narrow search). When good rewards come from far samples, σ increases (broaden search). The unit learns its own exploration-exploitation tradeoff. The learning rate for σ is set to a schedule σ always decreases over time

Chapter 8: Backprop Compatibility

A key insight: REINFORCE doesn't have to work alone. In networks with deterministic hidden units and stochastic output units, we can use backpropagation through the deterministic parts and REINFORCE only at the stochastic units.

The hybrid algorithm

Consider a network with deterministic hidden layers (using ReLU, sigmoid, etc.) and stochastic output units (Bernoulli, Gaussian). The exploration happens at the output layer; the deterministic layers simply compute features. We can decompose the gradient computation:

∂ln g / ∂w_ij = ∑_{k ∈ output} (∂ln g_k / ∂p_k) · (∂p_k / ∂w_ij)

The first term is the REINFORCE eligibility at the output. The second term is computed by standard backpropagation through the deterministic layers. The two combine seamlessly.

Backpropagating through random number generators: Williams also considers a more radical idea. For a Gaussian unit, the output can be written as y = μ + σz where z ~ N(0,1). Since y is a differentiable function of μ and σ (for fixed z), we can backpropagate through the sampling operation. This is exactly the reparameterization trick — rediscovered 20 years later by Kingma & Welling (2013) for VAEs. Williams had the idea first.

This compatibility means REINFORCE can be used as a component in larger systems. Use backprop where you can (deterministic layers), and REINFORCE where you must (stochastic layers). The gradient estimates compose correctly because backpropagation preserves unbiasedness.

Williams describes writing a Gaussian unit's output as y = μ + σz (z ~ N(0,1)) and backpropagating through it. What modern technique does this anticipate?

The reparameterization trick used in VAEs (Kingma & Welling, 2013) Batch normalization Dropout

Chapter 9: Connections

What REINFORCE built on

Learning automata (Narendra & Thathatchar, 1989): The LR-I algorithm for 2-action automata is a special case of REINFORCE. Williams showed this existing algorithm was already doing statistical gradient ascent — it just didn't know it.

Backpropagation (Rumelhart, Hinton & Williams, 1986): REINFORCE extends the gradient descent paradigm from supervised to reinforcement learning, and the paper shows how the two compose seamlessly.

Actor-critic methods (Barto, Sutton & Anderson, 1983): The AR-I (Associative Reward-Inaction) algorithm is exactly REINFORCE applied to Bernoulli-logistic networks. Williams unified existing algorithms under one theoretical umbrella.

What REINFORCE enabled

REINFORCE with baseline became the standard policy gradient method used in every subsequent algorithm. PPO, TRPO, A3C all use this exact gradient estimator — they just add variance reduction techniques and trust region constraints on top.

Actor-critic methods (A2C, A3C, GAE): Replace the single reward with a learned value function baseline V(s), giving per-timestep credit assignment. The advantage A(s,a) = r + γV(s′) − V(s) is the modern version of Williams's (r − b).

PPO (Schulman et al., 2017): Adds a clipped surrogate objective that constrains policy updates, preventing the large step sizes that make vanilla REINFORCE unstable. But the underlying gradient estimate is still ∇ ln π(a|s) · A(s,a) — pure REINFORCE with an advantage baseline.

RLHF (Ouyang et al., 2022): Training language models with human feedback uses REINFORCE-style policy gradients. The reward model provides r, the reference policy provides b, and the log-probability gradient is the eligibility. Williams's 1992 equation powers modern AI alignment.

The legacy: REINFORCE is arguably the most consequential algorithm in reinforcement learning. Its core equation — weight update proportional to reward times log-probability gradient — appears in every modern policy gradient method. From Atari-playing agents to language model alignment to robot control, Williams's 1992 insight that you can estimate the gradient of expected reward through the log-derivative trick is the foundation everything else builds on.

Cheat sheet

Core equation

Δw_ij = α(r − b) ∂ln g_i/∂w_ij

Theorem 1

E{ΔW | W} = α ∇_WE{r | W} — unbiased gradient of expected reward

Bernoulli eligibility

e_ij = (y_i − p_i) x_j

Baseline

Any b independent of y_i reduces variance without bias

Gaussian exploration

μ seeks reward peaks; σ auto-adjusts exploration radius

Which modern technique traces its lineage directly to Williams's REINFORCE?

Batch normalization Dropout regularization PPO, RLHF, and all policy gradient methods — they all use the log-derivative trick and advantage baselines from REINFORCE