From backgammon to Atari to Go — RL conquers games, hardware, and the real world.
For most of this book, RL has been a collection of algorithms applied to small, clean problems: gridworlds, random walks, mountain cars. This chapter shows what happens when those algorithms meet the real world — or at least very challenging simulated worlds. The results have been spectacular.
RL has achieved superhuman performance in backgammon (1992), Atari games (2015), the game of Go (2016), and StarCraft II (2019). But it's not just games. RL controls data center cooling at Google, schedules DRAM memory controllers, manages web service recommendations, and enables robots to walk and manipulate objects.
We'll trace this pattern through a series of landmark applications, from the earliest (Samuel's checkers, 1959) to the most recent (AlphaGo Zero, 2017). Each taught the RL community something new about what works and why.
In 1992, Gerald Tesauro created TD-Gammon, a backgammon program that learned entirely through self-play. It played millions of games against itself, using TD(λ) to learn a value function approximated by a neural network. The result: near-expert-level play that surprised even Tesauro.
Backgammon has roughly 1020 states — far too many for a lookup table. Tesauro used a neural network with about 80 hidden units to approximate V(s). The key insight was using afterstate values: evaluating the board position after the player moves but before the dice roll, which removes one source of randomness.
TD-Gammon went through several versions. Version 0 used only raw board features. Later versions added hand-crafted features and larger networks, eventually reaching a level where expert players said it had changed how humans play backgammon. It discovered novel strategies that humans hadn't considered in 5,000 years of play.
Long before TD-Gammon, Arthur Samuel at IBM created a checkers-playing program (1959) that is widely considered the first demonstration of machine learning. It learned from self-play, using what we now recognize as temporal-difference methods combined with a linear value function.
Samuel's program used a weighted sum of board features (number of pieces, advancement, center control) as its value function. It updated these weights using something remarkably close to TD(0) — adjusting predictions based on the difference between the current evaluation and the next position's evaluation.
The program had two key innovations. First, it learned from self-play by maintaining two copies of the value function: a "stable" version and a learning version. Second, it used a form of feature selection, periodically replacing the worst-performing features. While it never reached the very top level of play, it demonstrated the fundamental viability of learning from experience in complex domains.
Looking back with modern eyes, Samuel's system was a linear function approximation + TD + self-play system. All three ideas became cornerstones of RL. The main limitation was the linear approximation — it took neural networks (TD-Gammon) to break through to truly high-level play.
IBM's Watson famously won the Jeopardy! game show in 2011, defeating two of the greatest human champions. While Watson's question-answering system was primarily NLP, it used RL for a critical component: wagering strategy.
In Jeopardy!, players must decide how much to wager on Daily Doubles and in Final Jeopardy. The optimal wager depends on the current score, the opponent's score, the number of clues remaining, and the player's confidence in answering correctly. This is a sequential decision problem with a clear reward (winning the game) — a natural fit for RL.
The wagering strategy was surprisingly important. In tournament play, the right wager at the right time can turn a losing game into a winning one. Watson's TD-learned strategy was provably near-optimal in many scenarios and outperformed the heuristic strategies that human players typically use.
This is a nice example of RL complementing other AI techniques. Watson's NLP answered the questions; RL made the strategic decisions. The lesson: even when the core task is not an RL problem, the meta-decisions (what to risk, when to be aggressive) often are.
Not all RL applications involve games. Ipek et al. (2008) used RL to control a DRAM memory scheduler — the hardware component that decides the order in which memory access requests are served. This is a real-time control problem with microsecond-scale decisions and significant impact on computer performance.
The state includes the current queue of pending requests, the status of each memory bank, and recent access patterns. The actions are scheduling decisions: which request to serve next. The reward is a function of memory throughput and latency. The algorithm was Sarsa with tile coding for function approximation.
This application demonstrated that RL can work in domains with extremely tight latency constraints. The tile coding representation (Chapter 9) was essential — it provided the speed needed for real-time inference while still capturing the relevant state features. This is a template for applying RL to any real-time control problem.
In 2015, DeepMind published a paper that changed the field: a single RL algorithm, using only raw pixel inputs, learned to play 49 different Atari games, achieving superhuman performance on 29 of them. The algorithm was DQN — Deep Q-Network — and it marked the beginning of deep reinforcement learning.
DQN is Q-learning (Chapter 6) with a deep convolutional neural network as the function approximator. The input is a stack of 4 consecutive frames (84×84 grayscale). The network outputs Q(s,a) for all 18 possible Atari actions. The agent selects actions ε-greedily. But making this work required two critical innovations.
The DQN pipeline: raw pixels enter the CNN, Q-values come out. Experience replay stabilizes training. The target network prevents oscillation. Click through the pipeline stages.
The significance of DQN cannot be overstated. It showed that a single, general-purpose RL algorithm could learn complex behaviors from raw sensory inputs across many diverse domains. No game-specific engineering, no hand-crafted features. Just pixels and a score. This was the "ImageNet moment" for RL.
pseudocode # DQN Training Loop Initialize replay buffer D, Q-network θ, target network θ− = θ for each step: a ← ε-greedy from Q(s, ·; θ) observe r, s' store (s, a, r, s') in D # Sample random minibatch from D (sj, aj, rj, s'j) ~ D # Compute target using frozen target network yj = rj + γ maxa' Q(s'j, a'; θ−) # Update Q-network θ ← θ − α ∇θ(yj − Q(sj, aj; θ))² # Periodically: θ− ← θ
Go was long considered the "grand challenge" of board game AI. With 10170 possible positions (vastly more than chess's 1047), brute-force search was hopeless. Human intuition — the ability to look at a board and "feel" which areas are important — seemed essential. In 2016, AlphaGo defeated world champion Lee Sedol 4-1, using a remarkable fusion of deep learning, RL, and search.
AlphaGo's pipeline combined four components, trained in sequence:
The tree search (APV-MCTS) evaluated positions by combining the value network's estimate with the average outcome of fast rollouts using a lightweight policy. This ensemble approach was more robust than either alone. The MCTS used the SL policy network as a prior for action selection, focusing search on promising moves.
AlphaGo's Move 37 in Game 2 against Lee Sedol has become legendary — a move that no human would play, that expert commentators initially called a mistake, but that turned out to be brilliant. It was a product of RL self-play discovering strategies beyond human knowledge.
Just one year after AlphaGo's triumph, DeepMind released AlphaGo Zero (2017). It was simpler, stronger, and used no human data at all. Starting from random play, it learned entirely through self-play RL, defeating the original AlphaGo 100-0.
The architecture was dramatically streamlined. Instead of four separate networks and a complex training pipeline, AlphaGo Zero used a single two-headed network:
AlphaGo (2016)
• SL policy + RL policy + value net + rollout policy
• Trained on 30M human games
• RL fine-tuning as second stage
• Rollouts for position evaluation
• Multiple networks, complex pipeline
AlphaGo Zero (2017)
• One two-headed network: policy + value
• Zero human data
• Self-play from random initialization
• No rollouts — value head is sufficient
• Simpler, stronger, 100-0 vs AlphaGo
Watch the self-improvement cycle. MCTS generates better policies than the raw network. The network learns from MCTS. Better network makes better MCTS. Watch Elo rating climb.
The successor, AlphaZero (2018), applied the same algorithm to chess, shogi, and Go, achieving superhuman play in all three from scratch. The universality of the approach — zero human knowledge, zero game-specific tuning — is its greatest achievement.
Games get the headlines, but RL's impact extends far beyond entertainment. Here are several application domains where RL has made practical contributions:
| Domain | Method | Key Insight |
|---|---|---|
| Data center cooling (Google) | Deep RL | 40% reduction in cooling energy |
| Web service optimization | Contextual bandits | Personalize content in real time |
| Thermal soaring (gliders) | Policy gradient | Autonomous soaring in updrafts |
| Robotics locomotion | Policy gradient + sim | Sim-to-real transfer for walking |
| Chip design (Google) | Graph RL | Superhuman floor planning |
| Recommendation systems | Off-policy learning | Optimize long-term engagement |
Web service optimization is perhaps the largest commercial application of RL. Every time a website decides which ad, article, or product to show you, it's solving a contextual bandit problem (a one-step RL problem). The state is your profile; the action is the content; the reward is whether you click. Companies like Netflix, Amazon, and Google use RL-inspired methods at enormous scale.
The common theme: RL shines whenever the problem involves sequential decisions, delayed consequences, and an environment too complex for hand-crafted rules. When you can't write down the optimal solution but you can define a reward signal, RL is the tool to reach for.
This chapter surveyed RL's greatest hits — from Samuel's 1959 checkers program to AlphaGo Zero's superhuman Go from scratch. Each application taught the field something new about what works, and each relied on ideas from earlier chapters of this book.
| Application | Year | Key Technique | Chapter |
|---|---|---|---|
| Samuel's Checkers | 1959 | TD + linear approx + self-play | 6, 9 |
| TD-Gammon | 1992 | TD(λ) + neural net + afterstates | 6, 9, 12 |
| Watson wagering | 2011 | TD value estimation | 6 |
| Memory controller | 2008 | Sarsa + tile coding | 6, 9 |
| DQN / Atari | 2015 | Q-learning + CNN + replay + target net | 6, 9, 10 |
| AlphaGo | 2016 | SL + RL policy + value net + MCTS | 8, 13 |
| AlphaGo Zero | 2017 | Self-play + MCTS + two-headed net | 8, 13 |
What comes next: Chapter 15 showed how RL operates in the brain. This chapter showed how it operates in engineering. Chapter 17 looks ahead — at the open problems, the unexplored directions, and the future of RL. The algorithms we have are powerful, but the hardest challenges remain unsolved.