Legg, Chapter 6

TD Learning Without a Learning Rate

Deriving an automatic learning rate for temporal difference methods from statistical first principles.

Prerequisites: Basic RL concepts (states, rewards, value functions, TD learning).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Motivation

In Chapter 5 we hit theoretical limits. Pure theory cannot produce practical algorithms beyond a certain point. So this chapter takes a different approach: derive a useful algorithm from statistical principles, then test it experimentally.

The target is the learning rate α in temporal difference (TD) learning. Standard TD(λ) has a free parameter α that must be tuned by hand. Set it too high, and learning is unstable. Too low, and convergence is painfully slow. Worse, the optimal α changes depending on the problem and even over time.

The contribution: By minimising a well-defined loss function and bootstrapping, Legg derives a new TD update rule where the learning rate is automatically set for each state transition. No manual tuning required. The resulting algorithm, HL(λ), consistently matches or outperforms standard TD(λ) across multiple test environments.
Check: What problem does this chapter solve?

Chapter 1: TD Review

The expected future discounted reward of a state s is:

s = E{ rk + γrk+1 + γ2rk+2 + ... | sk = s }

The standard TD(0) update after transitioning from state s to s' with reward r:

Vst+1 := Vst + α(rt + γVs't - Vst)

The term (r + γVs' - Vs) is the TD error Δ. The learning rate α controls how much the estimate moves toward the new evidence. TD(λ) extends this with eligibility traces: a trace E(s) that decays by γλ each step and increments by 1 when state s is visited. All states get updated proportionally to their trace.

pseudocode
# Standard TD(λ)
Initialise V(s) arbitrarily, E(s) = 0
repeat:
  Observe state s, take action, observe r, s'
  Δ ← r + γV(s') − V(s)
  E(s) ← E(s) + 1
  for all states s:
    V(s) ← V(s) + α · E(s) · Δ
    E(s) ← γλ · E(s)
The problem with α: Too high → estimates oscillate wildly. Too low → convergence takes forever. Even with decreasing schedules like α = κ/√t, finding the right κ requires experimentation. For different problems, different schedules work best. Can we eliminate this free parameter entirely?
Check: What role does the learning rate α play in TD learning?

Chapter 2: The Loss Function

The key insight: instead of choosing α by trial and error, derive it by minimising a well-defined loss function.

The empirical future discounted reward at time k is the sum of actual rewards from that point:

vk := ∑u=k γu-k ru

We want Vst to be close to vk for all k where sk = s. Allowing old evidence to be discounted by λ, the loss function is:

L := ½ ∑k=1t λt-k (vk - Vst)2

For stationary environments, set λ = 1 (all evidence counts equally). For non-stationary environments, λ < 1 makes recent evidence count more.

Check: What does the loss function L measure?

Chapter 3: The Derivation

Take the partial derivative of L with respect to Vst, set it to zero, and solve. This gives an explicit expression for the optimal value estimate at each time step.

The derivation introduces three quantities tracked for each state s:

SymbolUpdate RuleMeaning
NstλNst-1 + δst,sDiscounted visit count
EstλγEst-1 + δst,sEligibility trace
RstλRst-1 + λEst-1rtDiscounted reward with eligibility

The crucial step: the true vt depends on future rewards, which we don't know. We bootstrap: replace the unknown future with our current estimate Vstt. This is the same bootstrapping trick that makes all TD methods work.

After considerable algebra (expanding, substituting, cancelling), the result is elegant:

Check: What is the role of bootstrapping in the derivation?

Chapter 4: The Update Rule

The HL(λ) update:
Vst+1 = Vst + Est βt(s, st+1) (rt + γVst+1t - Vstt)
where the automatic learning rate is:
βt(s, s') := (1 / (Ns't - γEs't)) · (Ns't / Nst)

This looks like standard TD(λ) with eligibility traces, but α has been replaced by β — a function that depends on the visit counts and eligibility traces of both the current state and the next state.

The first term in β provides normalisation: it adjusts for how many times the state has been visited. The second term is more interesting: Ns'/Ns compares visit counts. If the next state s' has been visited many more times than state s, the learning rate for s increases — we trust the well-estimated Vs' and use it aggressively to update Vs. If s is better-estimated, the learning rate decreases to maintain the existing estimate.

Why this is better: Standard TD uses the same α for all transitions, regardless of how well each state's value has been estimated. HL(λ) adapts the learning rate per-transition. Transitions involving well-estimated states get aggressive updates; transitions involving poorly-estimated states are conservative. This is exactly what you would want.
Check: How does HL(λ) set its learning rate differently from standard TD(λ)?

Chapter 5: Small Markov Chain

First test: a 21-state random walk. States 0-20, transitions left or right with equal probability (boundaries bounce). Reward +1 for moving from state 0 to 10, reward -1 for moving from state 20 to 10. Discount γ = 0.9.

Results over 10 runs: HL(1.0) learns very fast initially, with a more accurate final estimate and very stable mean squared error in the second half of the run. TD(0.7) with α=0.07 learns more slowly and, with a fixed learning rate, the error actually increases toward the end — a well-known problem with constant α.

Key observation: HL(λ) needed only one parameter (λ=1.0 for stationary environments). TD(λ) needed two parameters (λ and α), both requiring careful tuning. Despite fewer parameters, HL performed better.
Check: Why does the error of fixed-α TD increase toward the end of a long run?

Chapter 6: Larger Tests

Scaling up to 51 states with γ=0.99 showed consistent results. HL(λ) with λ=1.0 outperformed TD with the best manually tuned parameters, including decreasing learning rates of the form κ/√t and κ/√[3]{t}.

A random 50-state Markov process (random transition matrix) with γ=0.9 again showed HL winning. Different problem structure, same result.

Consistent finding: Across all tests, HL(λ) with no manual tuning matched or beat TD(λ) with careful tuning. The performance advantage was especially clear in the stability of error estimates toward the end of runs.
Check: What was consistent across all test environments?

Chapter 7: Non-Stationary Environments

In non-stationary environments, the dynamics change over time. The 21-state chain alternated between two configurations every 5,000 steps (different reward for the middle transition). Now λ < 1 is needed to discount old experience.

Optimal λ for HL fell from 1.0 to about 0.9995 — almost all history is still relevant for 5,000-step phases. TD needed both λ and α tuned to ~0.8 and ~0.05 respectively.

HL(0.9995) adapted faster at environment switches while maintaining low error during stable phases. The implicit per-transition learning rate adjusted automatically: recently visited states got updated more aggressively after a switch.

Check: How does HL(λ) handle non-stationary environments?

Chapter 8: Windy Gridworld

The final test: extending HL to reinforcement learning. Standard Sarsa(λ) and Q(λ) algorithms were modified by replacing α with the HL learning rate β. The resulting algorithms, HLS(λ) and HLQ(λ), were tested on the Windy Gridworld.

The Windy Gridworld is a 7×10 grid with a "wind" that blows the agent upward in certain columns (strength 1 or 2). The agent starts at (3,0) and must reach (3,7). Actions: up, down, left, right.

Results: HLS(λ) with λ=0.995 and ε=0.003 easily beat Sarsa(λ) with the best manually tuned α=0.4, λ=0.5, ε=0.005. HLQ(λ) was more competitive with standard Q(λ), but still achieved comparable performance with fewer tuned parameters.

The bottom line: HL-based methods produce superior performance with fewer parameters to tune. The combination of better performance and less manual effort is a significant practical advantage. The 1.7x computational overhead (from computing β values) is a modest price for eliminating a fragile free parameter.
Check: What is the main practical advantage of HL-based RL algorithms?

Chapter 9: Summary

The Problem
Learning rate α in TD(λ) must be manually tuned
↓ minimise a loss function
The Derivation
Minimise squared error of value estimates, then bootstrap
↓ produces
HL(λ)
Per-transition automatic learning rate β(s,s')
↓ experiments show
Superior Results
Matches/beats tuned TD in all tests, with fewer parameters

This chapter demonstrates that even when theoretical limits constrain what we can prove, statistical principles can still guide algorithm design. The HL learning rate is a concrete, practical contribution derived from the same variational framework that underlies Solomonoff induction and AIXI.

Check: What is the key innovation of HL(λ)?