Deriving an automatic learning rate for temporal difference methods from statistical first principles.
In Chapter 5 we hit theoretical limits. Pure theory cannot produce practical algorithms beyond a certain point. So this chapter takes a different approach: derive a useful algorithm from statistical principles, then test it experimentally.
The target is the learning rate α in temporal difference (TD) learning. Standard TD(λ) has a free parameter α that must be tuned by hand. Set it too high, and learning is unstable. Too low, and convergence is painfully slow. Worse, the optimal α changes depending on the problem and even over time.
The expected future discounted reward of a state s is:
The standard TD(0) update after transitioning from state s to s' with reward r:
The term (r + γVs' - Vs) is the TD error Δ. The learning rate α controls how much the estimate moves toward the new evidence. TD(λ) extends this with eligibility traces: a trace E(s) that decays by γλ each step and increments by 1 when state s is visited. All states get updated proportionally to their trace.
pseudocode # Standard TD(λ) Initialise V(s) arbitrarily, E(s) = 0 repeat: Observe state s, take action, observe r, s' Δ ← r + γV(s') − V(s) E(s) ← E(s) + 1 for all states s: V(s) ← V(s) + α · E(s) · Δ E(s) ← γλ · E(s)
The key insight: instead of choosing α by trial and error, derive it by minimising a well-defined loss function.
The empirical future discounted reward at time k is the sum of actual rewards from that point:
We want Vst to be close to vk for all k where sk = s. Allowing old evidence to be discounted by λ, the loss function is:
For stationary environments, set λ = 1 (all evidence counts equally). For non-stationary environments, λ < 1 makes recent evidence count more.
Take the partial derivative of L with respect to Vst, set it to zero, and solve. This gives an explicit expression for the optimal value estimate at each time step.
The derivation introduces three quantities tracked for each state s:
| Symbol | Update Rule | Meaning |
|---|---|---|
| Nst | λNst-1 + δst,s | Discounted visit count |
| Est | λγEst-1 + δst,s | Eligibility trace |
| Rst | λRst-1 + λEst-1rt | Discounted reward with eligibility |
The crucial step: the true vt depends on future rewards, which we don't know. We bootstrap: replace the unknown future with our current estimate Vstt. This is the same bootstrapping trick that makes all TD methods work.
After considerable algebra (expanding, substituting, cancelling), the result is elegant:
This looks like standard TD(λ) with eligibility traces, but α has been replaced by β — a function that depends on the visit counts and eligibility traces of both the current state and the next state.
The first term in β provides normalisation: it adjusts for how many times the state has been visited. The second term is more interesting: Ns'/Ns compares visit counts. If the next state s' has been visited many more times than state s, the learning rate for s increases — we trust the well-estimated Vs' and use it aggressively to update Vs. If s is better-estimated, the learning rate decreases to maintain the existing estimate.
First test: a 21-state random walk. States 0-20, transitions left or right with equal probability (boundaries bounce). Reward +1 for moving from state 0 to 10, reward -1 for moving from state 20 to 10. Discount γ = 0.9.
Results over 10 runs: HL(1.0) learns very fast initially, with a more accurate final estimate and very stable mean squared error in the second half of the run. TD(0.7) with α=0.07 learns more slowly and, with a fixed learning rate, the error actually increases toward the end — a well-known problem with constant α.
Scaling up to 51 states with γ=0.99 showed consistent results. HL(λ) with λ=1.0 outperformed TD with the best manually tuned parameters, including decreasing learning rates of the form κ/√t and κ/√[3]{t}.
A random 50-state Markov process (random transition matrix) with γ=0.9 again showed HL winning. Different problem structure, same result.
In non-stationary environments, the dynamics change over time. The 21-state chain alternated between two configurations every 5,000 steps (different reward for the middle transition). Now λ < 1 is needed to discount old experience.
Optimal λ for HL fell from 1.0 to about 0.9995 — almost all history is still relevant for 5,000-step phases. TD needed both λ and α tuned to ~0.8 and ~0.05 respectively.
HL(0.9995) adapted faster at environment switches while maintaining low error during stable phases. The implicit per-transition learning rate adjusted automatically: recently visited states got updated more aggressively after a switch.
The final test: extending HL to reinforcement learning. Standard Sarsa(λ) and Q(λ) algorithms were modified by replacing α with the HL learning rate β. The resulting algorithms, HLS(λ) and HLQ(λ), were tested on the Windy Gridworld.
The Windy Gridworld is a 7×10 grid with a "wind" that blows the agent upward in certain columns (strength 1 or 2). The agent starts at (3,0) and must reach (3,7). Actions: up, down, left, right.
Results: HLS(λ) with λ=0.995 and ε=0.003 easily beat Sarsa(λ) with the best manually tuned α=0.4, λ=0.5, ε=0.005. HLQ(λ) was more competitive with standard Q(λ), but still achieved comparable performance with fewer tuned parameters.
This chapter demonstrates that even when theoretical limits constrain what we can prove, statistical principles can still guide algorithm design. The HL learning rate is a concrete, practical contribution derived from the same variational framework that underlies Solomonoff induction and AIXI.