Ch 16: Applications — Sutton & Barto RL

Chapter 0: RL in the Real World

For most of this book, RL has been a collection of algorithms applied to small, clean problems: gridworlds, random walks, mountain cars. This chapter shows what happens when those algorithms meet the real world — or at least very challenging simulated worlds. The results have been spectacular.

RL has achieved superhuman performance in backgammon (1992), Atari games (2015), the game of Go (2016), and StarCraft II (2019). But it's not just games. RL controls data center cooling at Google, schedules DRAM memory controllers, manages web service recommendations, and enables robots to walk and manipulate objects.

The common pattern: Every RL success story follows the same template. (1) Define a suitable state representation. (2) Design a reward signal that captures the real objective. (3) Choose an RL algorithm that scales to the problem's complexity. (4) Train with enough experience (real or simulated). The challenge is always in steps 1-3; the algorithms from this book handle step 4.

We'll trace this pattern through a series of landmark applications, from the earliest (Samuel's checkers, 1959) to the most recent (AlphaGo Zero, 2017). Each taught the RL community something new about what works and why.

Check: What is the common pattern in successful RL applications?

Define good state representation, design appropriate reward, choose scalable algorithm, and train with sufficient experience Use the most complex algorithm available with maximum compute Copy human strategies exactly

Chapter 1: TD-Gammon

In 1992, Gerald Tesauro created TD-Gammon, a backgammon program that learned entirely through self-play. It played millions of games against itself, using TD(λ) to learn a value function approximated by a neural network. The result: near-expert-level play that surprised even Tesauro.

Backgammon has roughly 10²⁰ states — far too many for a lookup table. Tesauro used a neural network with about 80 hidden units to approximate V(s). The key insight was using afterstate values: evaluating the board position after the player moves but before the dice roll, which removes one source of randomness.

State

Raw board features (198 inputs encoding piece positions)

↓

Neural Network

80 hidden units → 1 output = V(s)

↓

Action Selection

Roll dice, evaluate all legal afterstates, pick highest V

↻ self-play for 1.5 million games

Afterstate values: In backgammon, you move, then the dice roll. Evaluating the position after your move but before the dice gives a much cleaner value estimate. The "afterstate" V function sees only the deterministic part of the transition. This same trick applies to any domain with a stochastic component after the agent's action (like Tetris).

TD-Gammon went through several versions. Version 0 used only raw board features. Later versions added hand-crafted features and larger networks, eventually reaching a level where expert players said it had changed how humans play backgammon. It discovered novel strategies that humans hadn't considered in 5,000 years of play.

Check: What is the advantage of using afterstate values in backgammon?

It reduces the neural network size It evaluates positions after the player's move but before the dice roll, removing the stochastic component from the value estimate It allows the program to cheat at dice

Chapter 2: Samuel's Checkers

Long before TD-Gammon, Arthur Samuel at IBM created a checkers-playing program (1959) that is widely considered the first demonstration of machine learning. It learned from self-play, using what we now recognize as temporal-difference methods combined with a linear value function.

Samuel's program used a weighted sum of board features (number of pieces, advancement, center control) as its value function. It updated these weights using something remarkably close to TD(0) — adjusting predictions based on the difference between the current evaluation and the next position's evaluation.

Historical significance: Samuel's program was the first to demonstrate that a computer could improve at a complex task through experience, without being explicitly programmed with a strategy. It was the first prominent use of TD learning with function approximation — the very combination we studied in Chapters 9-10. Samuel coined the phrase "machine learning" in 1959.

The program had two key innovations. First, it learned from self-play by maintaining two copies of the value function: a "stable" version and a learning version. Second, it used a form of feature selection, periodically replacing the worst-performing features. While it never reached the very top level of play, it demonstrated the fundamental viability of learning from experience in complex domains.

Looking back with modern eyes, Samuel's system was a linear function approximation + TD + self-play system. All three ideas became cornerstones of RL. The main limitation was the linear approximation — it took neural networks (TD-Gammon) to break through to truly high-level play.

Check: What was Arthur Samuel's key contribution to machine learning?

He demonstrated the first program that improved through experience using TD-like methods and self-play (1959) He invented neural networks He proved that checkers is a solved game

Chapter 3: Watson's Jeopardy

IBM's Watson famously won the Jeopardy! game show in 2011, defeating two of the greatest human champions. While Watson's question-answering system was primarily NLP, it used RL for a critical component: wagering strategy.

In Jeopardy!, players must decide how much to wager on Daily Doubles and in Final Jeopardy. The optimal wager depends on the current score, the opponent's score, the number of clues remaining, and the player's confidence in answering correctly. This is a sequential decision problem with a clear reward (winning the game) — a natural fit for RL.

The RL component: Watson used TD learning to estimate the probability of winning from any game state (score configuration). The value function V(s) = P(win | current scores, clues remaining) was learned by simulating millions of Jeopardy! games. Wagering decisions were then made to maximize this win probability, not just to maximize score.

The wagering strategy was surprisingly important. In tournament play, the right wager at the right time can turn a losing game into a winning one. Watson's TD-learned strategy was provably near-optimal in many scenarios and outperformed the heuristic strategies that human players typically use.

This is a nice example of RL complementing other AI techniques. Watson's NLP answered the questions; RL made the strategic decisions. The lesson: even when the core task is not an RL problem, the meta-decisions (what to risk, when to be aggressive) often are.

Check: What aspect of Jeopardy! did Watson use TD learning for?

Answering trivia questions Wagering strategy — learning how much to bet based on game state to maximize win probability Pressing the buzzer at the right time

Chapter 4: Memory Controller

Not all RL applications involve games. Ipek et al. (2008) used RL to control a DRAM memory scheduler — the hardware component that decides the order in which memory access requests are served. This is a real-time control problem with microsecond-scale decisions and significant impact on computer performance.

The state includes the current queue of pending requests, the status of each memory bank, and recent access patterns. The actions are scheduling decisions: which request to serve next. The reward is a function of memory throughput and latency. The algorithm was Sarsa with tile coding for function approximation.

State

Memory bank status, request queue, access history

↓

Tile Coding

Convert continuous state to binary features (Ch 9)

↓

Sarsa

Learn Q(s,a) → greedy scheduling policy

↻ microsecond decisions

Why RL beats hand-tuned controllers: Traditional memory schedulers use fixed priority rules (e.g., "first-come-first-served" or "row-hit-first"). These rules work well on average but can't adapt to specific workload patterns. The RL controller learns a workload-specific policy that outperforms all fixed rules — 15-20% better throughput in some workloads. And it runs on hardware, in real time.

This application demonstrated that RL can work in domains with extremely tight latency constraints. The tile coding representation (Chapter 9) was essential — it provided the speed needed for real-time inference while still capturing the relevant state features. This is a template for applying RL to any real-time control problem.

Check: Why was tile coding important for the memory controller application?

It made the state space finite It allowed the controller to learn offline It provided fast inference for real-time microsecond decisions while capturing relevant state features

Chapter 5: DQN and Atari

In 2015, DeepMind published a paper that changed the field: a single RL algorithm, using only raw pixel inputs, learned to play 49 different Atari games, achieving superhuman performance on 29 of them. The algorithm was DQN — Deep Q-Network — and it marked the beginning of deep reinforcement learning.

DQN is Q-learning (Chapter 6) with a deep convolutional neural network as the function approximator. The input is a stack of 4 consecutive frames (84×84 grayscale). The network outputs Q(s,a) for all 18 possible Atari actions. The agent selects actions ε-greedily. But making this work required two critical innovations.

The two breakthroughs: (1) Experience replay: store transitions in a large buffer and sample random minibatches for training, breaking correlations in sequential data. (2) Target network: use a slowly-updated copy of the network to compute TD targets, preventing the "moving target" instability. These two ideas made deep RL stable for the first time.

DQN Architecture

The DQN pipeline: raw pixels enter the CNN, Q-values come out. Experience replay stabilizes training. The target network prevents oscillation. Click through the pipeline stages.

Showing: Architecture — CNN maps pixels to Q-values for each action

The significance of DQN cannot be overstated. It showed that a single, general-purpose RL algorithm could learn complex behaviors from raw sensory inputs across many diverse domains. No game-specific engineering, no hand-crafted features. Just pixels and a score. This was the "ImageNet moment" for RL.

pseudocode
# DQN Training Loop
Initialize replay buffer D, Q-network θ, target network θ⁻ = θ

for each step:
  a ← ε-greedy from Q(s, ·; θ)
  observe r, s'
  store (s, a, r, s') in D

  # Sample random minibatch from D
  (s_j, a_j, r_j, s'_j) ~ D

  # Compute target using frozen target network
  y_j = r_j + γ max_a' Q(s'_j, a'; θ⁻)

  # Update Q-network
  θ ← θ − α ∇_θ(y_j − Q(s_j, a_j; θ))²

  # Periodically: θ⁻ ← θ

Check: What are the two key innovations that made DQN stable?

Experience replay (break correlations) and target network (stabilize TD targets) Larger networks and more compute Game-specific reward shaping and curriculum learning

Chapter 6: AlphaGo

Go was long considered the "grand challenge" of board game AI. With 10¹⁷⁰ possible positions (vastly more than chess's 10⁴⁷), brute-force search was hopeless. Human intuition — the ability to look at a board and "feel" which areas are important — seemed essential. In 2016, AlphaGo defeated world champion Lee Sedol 4-1, using a remarkable fusion of deep learning, RL, and search.

AlphaGo's pipeline combined four components, trained in sequence:

1. SL Policy Network p_σ

Trained on 30M human expert moves. Predicts human actions.

↓

2. RL Policy Network p_ρ

Fine-tuned via self-play REINFORCE. Wins 80% vs SL policy.

↓

3. Value Network v_θ

Trained on RL self-play positions. Predicts P(win | position).

↓

4. APV-MCTS

Tree search guided by policy + value networks. Combines all components.

The role of RL: The SL policy learned to imitate human experts. The RL policy learned to win. These are different objectives — the most "human-like" move is not always the best move. RL self-play improved the policy beyond human expert level. The value network then distilled this improved policy into a position evaluator for search.

The tree search (APV-MCTS) evaluated positions by combining the value network's estimate with the average outcome of fast rollouts using a lightweight policy. This ensemble approach was more robust than either alone. The MCTS used the SL policy network as a prior for action selection, focusing search on promising moves.

AlphaGo's Move 37 in Game 2 against Lee Sedol has become legendary — a move that no human would play, that expert commentators initially called a mistake, but that turned out to be brilliant. It was a product of RL self-play discovering strategies beyond human knowledge.

Check: Why did AlphaGo need both a supervised learning policy and an RL policy?

The SL policy was too slow for real-time play The RL policy couldn't learn from scratch SL learns to imitate humans, but RL learns to win — the best move is not always the most human-like move

Chapter 7: AlphaGo Zero

Just one year after AlphaGo's triumph, DeepMind released AlphaGo Zero (2017). It was simpler, stronger, and used no human data at all. Starting from random play, it learned entirely through self-play RL, defeating the original AlphaGo 100-0.

The architecture was dramatically streamlined. Instead of four separate networks and a complex training pipeline, AlphaGo Zero used a single two-headed network:

AlphaGo (2016)

• SL policy + RL policy + value net + rollout policy
• Trained on 30M human games
• RL fine-tuning as second stage
• Rollouts for position evaluation
• Multiple networks, complex pipeline

AlphaGo Zero (2017)

• One two-headed network: policy + value
• Zero human data
• Self-play from random initialization
• No rollouts — value head is sufficient
• Simpler, stronger, 100-0 vs AlphaGo

The training loop: AlphaGo Zero uses MCTS guided by the current network to generate training data. The network is then trained to predict (1) the MCTS-refined action probabilities (policy target) and (2) the eventual game outcome (value target). The MCTS acts as a policy improvement operator — it produces better action probabilities than the raw network, which the network then learns to imitate. This is a beautiful self-improving loop.

AlphaGo Zero Self-Play Loop

Watch the self-improvement cycle. MCTS generates better policies than the raw network. The network learns from MCTS. Better network makes better MCTS. Watch Elo rating climb.

Iteration 0 — Random play

The successor, AlphaZero (2018), applied the same algorithm to chess, shogi, and Go, achieving superhuman play in all three from scratch. The universality of the approach — zero human knowledge, zero game-specific tuning — is its greatest achievement.

Check: What makes AlphaGo Zero's training loop self-improving?

MCTS produces better policies than the raw network, the network learns to match MCTS output, which then improves the next round of MCTS It uses a larger network each iteration It gradually adds more human games to the training set

Chapter 8: Other Applications

Games get the headlines, but RL's impact extends far beyond entertainment. Here are several application domains where RL has made practical contributions:

Domain	Method	Key Insight
Data center cooling (Google)	Deep RL	40% reduction in cooling energy
Web service optimization	Contextual bandits	Personalize content in real time
Thermal soaring (gliders)	Policy gradient	Autonomous soaring in updrafts
Robotics locomotion	Policy gradient + sim	Sim-to-real transfer for walking
Chip design (Google)	Graph RL	Superhuman floor planning
Recommendation systems	Off-policy learning	Optimize long-term engagement

Thermal soaring: An RL-controlled glider learned to exploit rising columns of warm air (thermals) to stay aloft indefinitely — just as real birds do. The agent learned a policy mapping wind measurements to banking angles. It had no explicit model of atmospheric physics; it discovered the circular soaring pattern purely through trial and reward. Nature and RL converge on the same solution.

Web service optimization is perhaps the largest commercial application of RL. Every time a website decides which ad, article, or product to show you, it's solving a contextual bandit problem (a one-step RL problem). The state is your profile; the action is the content; the reward is whether you click. Companies like Netflix, Amazon, and Google use RL-inspired methods at enormous scale.

The common theme: RL shines whenever the problem involves sequential decisions, delayed consequences, and an environment too complex for hand-crafted rules. When you can't write down the optimal solution but you can define a reward signal, RL is the tool to reach for.

Check: What makes web recommendation systems an RL problem?

They use neural networks The system must choose actions (content to display) based on state (user profile) to maximize reward (engagement), often with delayed consequences They require GPU clusters

Chapter 9: Summary

This chapter surveyed RL's greatest hits — from Samuel's 1959 checkers program to AlphaGo Zero's superhuman Go from scratch. Each application taught the field something new about what works, and each relied on ideas from earlier chapters of this book.

Application	Year	Key Technique	Chapter
Samuel's Checkers	1959	TD + linear approx + self-play	6, 9
TD-Gammon	1992	TD(λ) + neural net + afterstates	6, 9, 12
Watson wagering	2011	TD value estimation	6
Memory controller	2008	Sarsa + tile coding	6, 9
DQN / Atari	2015	Q-learning + CNN + replay + target net	6, 9, 10
AlphaGo	2016	SL + RL policy + value net + MCTS	8, 13
AlphaGo Zero	2017	Self-play + MCTS + two-headed net	8, 13

The trajectory: Look at the progression: from linear features (1959) to shallow neural nets (1992) to deep CNNs (2015) to ResNets (2017). The RL algorithms are largely the same — TD learning, Q-learning, policy gradients, MCTS. What changed was the function approximation. Better representations enabled bigger problems. This is the central lesson of Part II of the book.

What comes next: Chapter 15 showed how RL operates in the brain. This chapter showed how it operates in engineering. Chapter 17 looks ahead — at the open problems, the unexplored directions, and the future of RL. The algorithms we have are powerful, but the hardest challenges remain unsolved.

"The most exciting phrase to hear in science, the one that heralds new discoveries,
is not 'Eureka!' but 'That's funny...'"
— Isaac Asimov

Check: What is the common factor in the progression from Samuel's Checkers to AlphaGo Zero?

Each used a fundamentally different RL algorithm The core RL algorithms (TD, Q-learning, policy gradient) stayed largely the same — the breakthroughs came from better function approximation Each required more human knowledge than the last

RL Applications