The definitive RL textbook, rebuilt chapter by chapter as interactive lessons. From bandits to policy gradients, with live simulations at every step.
Exploration vs exploitation, epsilon-greedy, UCB, gradient bandits.
Agent-environment interface, returns, value functions, Bellman equations.
Policy evaluation, policy improvement, value iteration, GPI.
MC prediction, MC control, importance sampling, off-policy.
TD(0), SARSA, Q-learning, expected SARSA, double learning.
n-step TD, n-step SARSA, tree backup algorithm.
Dyna, prioritized sweeping, MCTS, model-based RL.
SGD, linear methods, tile coding, neural networks, LSTD.
Semi-gradient SARSA, average reward, continuing tasks.
The deadly triad, Baird's counterexample, gradient-TD.
Lambda-return, TD(lambda), true online TD(lambda), SARSA(lambda).
REINFORCE, baselines, actor-critic, policy gradient theorem.
Classical conditioning, instrumental conditioning, TD model.
Reward prediction error, dopamine, basal ganglia, habits.
TD-Gammon, Atari DQN, AlphaGo, personalized web services.
Options, temporal abstraction, reward design, future of RL.