Legg, Chapter 6

TD Learning Without a Learning Rate

Deriving an automatic learning rate for temporal difference methods from statistical first principles.

Prerequisites: Basic RL concepts (states, rewards, value functions, TD learning).

Chapters

Simulations

Quizzes

Chapter 0: Motivation

In Chapter 5 we hit theoretical limits. Pure theory cannot produce practical algorithms beyond a certain point. So this chapter takes a different approach: derive a useful algorithm from statistical principles, then test it experimentally.

The target is the learning rate α in temporal difference (TD) learning. Standard TD(λ) has a free parameter α that must be tuned by hand. Set it too high, and learning is unstable. Too low, and convergence is painfully slow. Worse, the optimal α changes depending on the problem and even over time.

The contribution: By minimising a well-defined loss function and bootstrapping, Legg derives a new TD update rule where the learning rate is automatically set for each state transition. No manual tuning required. The resulting algorithm, HL(λ), consistently matches or outperforms standard TD(λ) across multiple test environments.

Check: What problem does this chapter solve?

The need to manually tune the learning rate in temporal difference learning The halting problem The problem of exploration

Chapter 1: TD Review

The expected future discounted reward of a state s is:

V̄_s = E{ r_k + γr_k+1 + γ²r_k+2 + ... | s_k = s }

The standard TD(0) update after transitioning from state s to s' with reward r:

V_s^t+1 := V_s^t + α(r_t + γV_s'^t - V_s^t)

The term (r + γV_s' - V_s) is the TD error Δ. The learning rate α controls how much the estimate moves toward the new evidence. TD(λ) extends this with eligibility traces: a trace E(s) that decays by γλ each step and increments by 1 when state s is visited. All states get updated proportionally to their trace.

pseudocode
# Standard TD(λ)
Initialise V(s) arbitrarily, E(s) = 0
repeat:
  Observe state s, take action, observe r, s'
  Δ ← r + γV(s') − V(s)
  E(s) ← E(s) + 1
  for all states s:
    V(s) ← V(s) + α · E(s) · Δ
    E(s) ← γλ · E(s)

The problem with α: Too high → estimates oscillate wildly. Too low → convergence takes forever. Even with decreasing schedules like α = κ/√t, finding the right κ requires experimentation. For different problems, different schedules work best. Can we eliminate this free parameter entirely?

Check: What role does the learning rate α play in TD learning?

It controls how much the value estimate moves toward new evidence on each update It determines the discount factor It sets the exploration rate

Chapter 2: The Loss Function

The key insight: instead of choosing α by trial and error, derive it by minimising a well-defined loss function.

The empirical future discounted reward at time k is the sum of actual rewards from that point:

v_k := ∑_u=k^∞ γ^u-k r_u

We want V_s^t to be close to v_k for all k where s_k = s. Allowing old evidence to be discounted by λ, the loss function is:

L := ½ ∑_k=1^t λ^t-k (v_k - V_s^t)²

For stationary environments, set λ = 1 (all evidence counts equally). For non-stationary environments, λ < 1 makes recent evidence count more.

Check: What does the loss function L measure?

The weighted sum of squared differences between our value estimates and the actual discounted returns The total reward received The number of states visited

Chapter 3: The Derivation

Take the partial derivative of L with respect to V_s^t, set it to zero, and solve. This gives an explicit expression for the optimal value estimate at each time step.

The derivation introduces three quantities tracked for each state s:

Symbol	Update Rule	Meaning
N_s^t	λN_s^t-1 + δ_{s_t,s}	Discounted visit count
E_s^t	λγE_s^t-1 + δ_{s_t,s}	Eligibility trace
R_s^t	λR_s^t-1 + λE_s^t-1r_t	Discounted reward with eligibility

The crucial step: the true v_t depends on future rewards, which we don't know. We bootstrap: replace the unknown future with our current estimate V_{s_t}^t. This is the same bootstrapping trick that makes all TD methods work.

After considerable algebra (expanding, substituting, cancelling), the result is elegant:

Check: What is the role of bootstrapping in the derivation?

It replaces unknown future rewards with current value estimates, making the formula computable It reduces the number of states It eliminates the discount factor

Chapter 4: The Update Rule

The HL(λ) update:

V_s^t+1 = V_s^t + E_s^t β_t(s, s_t+1) (r_t + γV_{s_t+1}^t - V_{s_t}^t)

where the automatic learning rate is:

β_t(s, s') := (1 / (N_s'^t - γE_s'^t)) · (N_s'^t / N_s^t)

This looks like standard TD(λ) with eligibility traces, but α has been replaced by β — a function that depends on the visit counts and eligibility traces of both the current state and the next state.

The first term in β provides normalisation: it adjusts for how many times the state has been visited. The second term is more interesting: N_s'/N_s compares visit counts. If the next state s' has been visited many more times than state s, the learning rate for s increases — we trust the well-estimated V_s' and use it aggressively to update V_s. If s is better-estimated, the learning rate decreases to maintain the existing estimate.

Why this is better: Standard TD uses the same α for all transitions, regardless of how well each state's value has been estimated. HL(λ) adapts the learning rate per-transition. Transitions involving well-estimated states get aggressive updates; transitions involving poorly-estimated states are conservative. This is exactly what you would want.

Check: How does HL(λ) set its learning rate differently from standard TD(λ)?

It uses a per-transition rate based on visit counts and eligibility traces, adapting to each state pair It uses a fixed rate of 0.5 It decreases α by 1/t

Chapter 5: Small Markov Chain

First test: a 21-state random walk. States 0-20, transitions left or right with equal probability (boundaries bounce). Reward +1 for moving from state 0 to 10, reward -1 for moving from state 20 to 10. Discount γ = 0.9.

Results over 10 runs: HL(1.0) learns very fast initially, with a more accurate final estimate and very stable mean squared error in the second half of the run. TD(0.7) with α=0.07 learns more slowly and, with a fixed learning rate, the error actually increases toward the end — a well-known problem with constant α.

Key observation: HL(λ) needed only one parameter (λ=1.0 for stationary environments). TD(λ) needed two parameters (λ and α), both requiring careful tuning. Despite fewer parameters, HL performed better.

Check: Why does the error of fixed-α TD increase toward the end of a long run?

Because with a fixed learning rate, old information and new information are weighted equally, causing estimates to oscillate around the true values Because the environment changes Because there are too many states

Chapter 6: Larger Tests

Scaling up to 51 states with γ=0.99 showed consistent results. HL(λ) with λ=1.0 outperformed TD with the best manually tuned parameters, including decreasing learning rates of the form κ/√t and κ/√[3]{t}.

A random 50-state Markov process (random transition matrix) with γ=0.9 again showed HL winning. Different problem structure, same result.

Consistent finding: Across all tests, HL(λ) with no manual tuning matched or beat TD(λ) with careful tuning. The performance advantage was especially clear in the stability of error estimates toward the end of runs.

Check: What was consistent across all test environments?

HL(λ) with no manual tuning matched or beat carefully tuned TD(λ) TD(λ) was always faster Both algorithms performed equally

Chapter 7: Non-Stationary Environments

In non-stationary environments, the dynamics change over time. The 21-state chain alternated between two configurations every 5,000 steps (different reward for the middle transition). Now λ < 1 is needed to discount old experience.

Optimal λ for HL fell from 1.0 to about 0.9995 — almost all history is still relevant for 5,000-step phases. TD needed both λ and α tuned to ~0.8 and ~0.05 respectively.

HL(0.9995) adapted faster at environment switches while maintaining low error during stable phases. The implicit per-transition learning rate adjusted automatically: recently visited states got updated more aggressively after a switch.

Check: How does HL(λ) handle non-stationary environments?

By setting λ slightly below 1 to discount old experience, while the per-transition learning rate automatically adapts to changes By resetting all values to zero It cannot handle non-stationary environments

Chapter 8: Windy Gridworld

The final test: extending HL to reinforcement learning. Standard Sarsa(λ) and Q(λ) algorithms were modified by replacing α with the HL learning rate β. The resulting algorithms, HLS(λ) and HLQ(λ), were tested on the Windy Gridworld.

The Windy Gridworld is a 7×10 grid with a "wind" that blows the agent upward in certain columns (strength 1 or 2). The agent starts at (3,0) and must reach (3,7). Actions: up, down, left, right.

Results: HLS(λ) with λ=0.995 and ε=0.003 easily beat Sarsa(λ) with the best manually tuned α=0.4, λ=0.5, ε=0.005. HLQ(λ) was more competitive with standard Q(λ), but still achieved comparable performance with fewer tuned parameters.

The bottom line: HL-based methods produce superior performance with fewer parameters to tune. The combination of better performance and less manual effort is a significant practical advantage. The 1.7x computational overhead (from computing β values) is a modest price for eliminating a fragile free parameter.

Check: What is the main practical advantage of HL-based RL algorithms?

Better or equal performance with fewer parameters to tune They run faster They use less memory

Chapter 9: Summary

The Problem

Learning rate α in TD(λ) must be manually tuned

↓ minimise a loss function

The Derivation

Minimise squared error of value estimates, then bootstrap

↓ produces

HL(λ)

Per-transition automatic learning rate β(s,s')

↓ experiments show

Superior Results

Matches/beats tuned TD in all tests, with fewer parameters

This chapter demonstrates that even when theoretical limits constrain what we can prove, statistical principles can still guide algorithm design. The HL learning rate is a concrete, practical contribution derived from the same variational framework that underlies Solomonoff induction and AIXI.

Check: What is the key innovation of HL(λ)?

Replacing the fixed learning rate α with an automatically computed per-transition rate β derived from statistical principles Using deeper neural networks A new exploration strategy