Sutton & Barto, Chapter 17

RL Frontiers

The open problems, future directions, and the road from here to general intelligence.

Prerequisites: All prior chapters (especially 6, 8, 13, 16). This is the capstone.
10
Chapters
3
Simulations
10
Quizzes

Chapter 0: Beyond Standard RL

Throughout this book, we've studied a clean, well-defined framework: an agent interacts with an environment, observes states, takes actions, receives rewards, and learns to maximize cumulative reward. This framework is powerful — it gave us everything from TD-Gammon to AlphaGo Zero. But the real world is messier.

In the real world, the state isn't fully observable. The reward function isn't given — it must be designed or learned. Actions take variable amounts of time. Agents must learn many things simultaneously, not just one task. These are the frontier problems of RL, and solving them is essential for moving from game-playing to general intelligence.

What's missing: The standard RL framework assumes (1) full state observability, (2) a fixed, given reward function, (3) primitive actions at fixed time steps, and (4) a single learning task. Each of these assumptions breaks in realistic applications. This chapter explores what happens when we relax them.

These aren't just academic concerns. Every robotics application faces partial observability. Every real deployment requires reward design. Every complex task benefits from temporal abstraction. The ideas in this chapter are where the field is heading.

Check: Which of these is NOT an assumption of the standard RL framework studied in this book?

Chapter 1: General Value Functions

Throughout this book, value functions have predicted one thing: future cumulative reward. But there's nothing special about reward. An agent could learn to predict any signal that changes over time — temperature, distance to a wall, time since last event, or the output of any sensor. These are General Value Functions (GVFs).

A GVF is defined by three components: a policy (what behavior to predict under), a cumulant (the signal being predicted, replacing reward), and a termination function (how long into the future to look). The standard value function is just one GVF where the cumulant is reward, the policy is the target policy, and the discount is γ.

Vc,π,γ(s) = Eπ[Ct+1 + γ Ct+2 + γ² Ct+3 + ... | St = s]

where C is the cumulant signal (any function of state), π is the prediction policy, and γ is the discount.

Knowledge as predictions: Sutton proposes that knowledge itself can be represented as a large collection of GVFs. "If I walk forward, how many steps until I hit a wall?" is a GVF (cumulant = 1 per step, policy = walk forward, termination = hitting a wall). "If I do nothing, will the temperature rise?" is another. Thousands of such GVFs together form a rich predictive model of the world.

GVFs can be learned with standard TD methods — the same algorithms work, just with a different cumulant instead of reward. This means an agent can learn hundreds of predictive questions about its environment simultaneously, using off-policy methods, even while pursuing a different behavioral policy.

Check: What distinguishes a General Value Function from a standard value function?

Chapter 2: Auxiliary Tasks

In deep RL, representation is everything. The agent must learn useful features from raw sensory input before it can learn a good policy. But reward signals are sparse — in many environments, the agent receives reward rarely, providing weak gradient signal for feature learning. Auxiliary tasks supplement the main reward with additional learning signals that shape better representations.

The idea is simple: add extra prediction heads to the network. One head predicts reward (the main task). Other heads predict pixel changes, immediate rewards, next-frame features, or GVF cumulants. All heads share the same representation layers. The auxiliary losses provide dense gradient signal that shapes features even when reward is sparse.

Shared Representation
CNN / encoder processing raw observations
Main Head
Policy + Value (for RL objective)
+
Aux Head 1
Pixel change prediction
+
Aux Head 2
Reward prediction / GVFs
UNREAL agent: DeepMind's UNREAL agent (2016) added auxiliary tasks to A3C and achieved 10x faster learning on Atari. The auxiliary tasks were: (1) predicting how much each pixel will change (focusing on interesting parts of the scene), and (2) maximizing/minimizing immediate reward from replay. These cost almost nothing to compute but dramatically improve representation quality.

Auxiliary tasks are closely related to GVFs — each auxiliary prediction head is learning a GVF. The deep learning perspective adds the crucial insight: these auxiliary tasks aren't just for their own sake. They exist to shape the shared representation, making the main task easier. This is representation learning via prediction.

Check: Why do auxiliary tasks help even though they don't directly relate to the main reward?

Chapter 3: Options

When you decide to "go to the kitchen," you don't plan individual muscle contractions. You invoke a high-level action — an option — that handles the details automatically. You plan at the level of rooms and destinations, not individual steps. This is temporal abstraction, and it's one of the most important ideas in RL.

An option (Sutton, Precup & Singh, 1999) is a temporally extended action defined by three components: an initiation set I (which states can the option start in?), an internal policy π (how does it behave?), and a termination condition β (when does it stop?). Primitive actions are just options that last one step.

Why options matter: Planning with options is exponentially faster than planning with primitive actions. To plan a path across a building, you might need 1000 primitive steps but only 5 options ("go to hallway, take elevator, walk to room 302..."). Options also enable transfer: a "navigate to door" option learned in one environment works in any environment with doors.
Hierarchical Agent: Options vs Primitive Actions

Compare an agent using only primitive actions (left/right/up/down) vs. one using high-level options ("go to room"). Watch how temporal abstraction speeds up planning. Click rooms to set goals.

Grid size10
Click a button to plan. Primitive planning explores every cell; option planning jumps between rooms.

The options framework extends the standard MDP to a semi-MDP (SMDP), where actions can take variable numbers of time steps. All RL algorithms generalize to SMDPs: Q-learning over options, policy gradient over options, and even model-based planning over options. The math is nearly identical — just with variable-length time steps.

Check: What are the three components that define an option?

Chapter 4: Option Models

In Chapter 8, we learned that models let agents plan. A model says "if I take action a in state s, I'll transition to s' with reward r." For options, we need the same thing: "if I invoke option o in state s, where will I end up and how much reward will I accumulate?" These are option models.

An option model has two components: a reward model (the expected cumulative reward during the option) and a transition model (the probability distribution over termination states). Together, they describe what the option "does" without specifying how it does it internally.

R(s, o) = E[r1 + γr2 + ... + γk-1rk | S0=s, option o]
P(s' | s, o) = P(Sk=s' | S0=s, option o)
Planning speedup: With option models, the agent can plan at the level of options: "if I use the 'go-to-kitchen' option, I'll arrive at the kitchen in about 20 steps with zero reward, then I can use the 'get-food' option for reward." This is enormously faster than planning 20 primitive steps. Option models make high-level planning tractable.

Option models can be learned from experience, just like regular models (Chapter 8). The agent invokes an option, observes the cumulative reward and final state, and updates the model. Alternatively, they can be computed from the option's internal policy and the environment model. The Dyna architecture extends naturally: learn option models from experience, then plan using those models.

Check: What does an option model predict?

Chapter 5: Partial Observability

Every RL algorithm in this book (except this section) assumes the agent observes the full state. In a gridworld, the agent knows exactly which cell it occupies. In backgammon, the entire board is visible. But in the real world, you don't see the full state. A robot sees camera images, not the underlying physics. A trader sees prices, not the full economy. The state is partially observable.

The formal framework is the Partially Observable MDP (POMDP). There is a hidden state s, but the agent only sees an observation o that depends stochastically on s. Different states can produce the same observation (aliasing). The agent must maintain a belief state — a probability distribution over possible hidden states — and make decisions based on that belief.

Full Observability (MDP)

Agent sees state s directly. Policy maps states to actions: π(a|s). Value is a function of state: V(s). Standard RL works.

Partial Observability (POMDP)

Agent sees observation o, not state s. Must maintain belief b(s) = P(s|history). Policy maps beliefs to actions: π(a|b). Belief is the "state" of the POMDP.

The challenge: The belief state is a probability distribution over all possible states. For N states, the belief lives in an (N-1)-dimensional continuous space. Even a 100-state problem becomes a 99-dimensional continuous RL problem. POMDPs are fundamentally harder than MDPs. In practice, most deep RL systems sidestep this by using recurrent networks (LSTMs) to implicitly track belief from observation history.

A common practical approach is to stack observations: use the last k frames as input, as DQN did with Atari. This helps with short-term partial observability (like knowing velocity from two position observations) but doesn't solve long-term dependencies. Recurrent networks offer a more principled solution, maintaining a learned hidden state that summarizes observation history.

Check: What makes POMDPs harder than MDPs?

Chapter 6: Predictive State Representations

POMDPs define the hidden state in terms of the underlying dynamics. But the agent never sees those dynamics directly — it only sees observations and actions. Is there a way to represent state entirely in terms of observable quantities? Predictive State Representations (PSRs) offer exactly this.

A PSR represents the current state as a vector of predictions about future observations. Instead of saying "I believe I'm in state 3 with probability 0.7," a PSR says "if I take action LEFT, I predict I'll see observation A with probability 0.8; if I take action RIGHT, I predict observation B with probability 0.3." The state is the set of predictions.

Grounding state in observables: In a POMDP, the "hidden state" is a theoretical construct that may not correspond to anything the agent can ever verify. A PSR defines state only in terms of testable predictions: "what will I observe if I do X?" This is epistemologically cleaner — every component of the state representation is, in principle, verifiable from experience.

PSRs have a beautiful theoretical property: for any finite POMDP, there exists a PSR of finite dimension that captures the complete state. The dimension of the PSR is at most the number of hidden states, and can be smaller. In some cases, a compact PSR exists even when the full belief state is high-dimensional.

In practice, PSRs haven't yet been as successful as recurrent neural networks for handling partial observability. But the idea — that state should be grounded in predictions about future experience — has influenced how we think about representation learning in deep RL.

Check: How does a PSR represent the current state?

Chapter 7: Reward Design

Every RL application requires a reward function. For games, reward is obvious: win = +1, lose = -1. But for real-world tasks, designing the reward is often the hardest part. A robot tasked with cleaning a room might learn to cover the room with a cloth (technically "clean" but useless), or learn to push all objects off the table (the table is clean!).

The reward design problem is: how do you specify a reward function that actually incentivizes the behavior you want? This is harder than it sounds, because agents are remarkably creative at finding unintended shortcuts.

Reward hacking: A famous example: a simulated boat trained to maximize race score discovered that by spinning in circles and hitting small bonus targets, it could score higher than by finishing the race. The reward said "maximize score"; the agent obeyed. The designer's intent and the formal reward were not aligned. This is the reward hacking problem.
Reward Shaping Explorer

Compare agent behavior with sparse reward (only at goal) vs shaped reward (distance-based guidance). Shaped rewards speed learning but can introduce bias if designed carelessly.

Shaping strength0.00
Shaping = 0: pure sparse reward. Increase shaping to add distance-based guidance.

Three approaches to reward design have emerged:

Reward shaping adds an extra term F(s,s') to the reward to guide the agent, while provably preserving the optimal policy (if the shaping function is a potential-based function). Inverse RL infers the reward function from demonstrations of desired behavior. Reward learning from human feedback (RLHF) has the human directly evaluate agent behavior and learns a reward model from those evaluations.

Check: What is "reward hacking"?

Chapter 8: Open Problems

Despite its remarkable successes, RL faces fundamental open challenges. These are not incremental improvements — they are problems where current methods fundamentally struggle. Solving them would transform the field.

ProblemWhy It's HardCurrent Status
Sample efficiencyReal-world data is expensiveModel-based RL helps but doesn't solve it
Online deep learningNeural nets are unstable with streaming dataReplay buffers are a workaround, not a solution
SafetyExploration can be dangerous in the real worldConstrained MDPs, conservative methods
Curiosity / intrinsic motivationAgents need to explore even without rewardPrediction error, count-based bonuses
Continual learningLearning new tasks without forgetting old onesCatastrophic forgetting remains unsolved
Representation learningWhat features should the agent build?Auxiliary tasks, world models help
The deadly triad revisited: In Chapter 11, we met the deadly triad: function approximation + bootstrapping + off-policy learning. This combination can diverge. Despite many proposed solutions (gradient TD, emphatic TD, Retrace), there is no fully satisfactory resolution. Deep RL often works in practice, but lacks the theoretical guarantees of tabular methods. Understanding why deep RL works as well as it does remains an open question.

Curiosity and intrinsic motivation address a fundamental chicken-and-egg problem: how do you learn in environments with no reward signal? The idea is to create an internal "reward" for encountering novel or unpredictable states. This drives the agent to explore its environment systematically, building knowledge that will be useful when extrinsic reward does appear.

Safety is perhaps the most practically important open problem. A robot learning to walk might fall and break itself. A medical RL system making a bad decision could harm a patient. We need RL methods that explore safely, respect constraints, and degrade gracefully. Constrained MDPs and safe exploration are active areas of research, but far from solved.

Check: What is the "deadly triad" and why does it remain a problem?

Chapter 9: Summary — The Complete Journey

This is the end of the road. Across 17 chapters, we've traveled from the simplest possible RL setting — a single bandit with two arms — to the frontiers of AI research. Let's step back and see the full arc.

PartChaptersCore Idea
Foundations1-3Bandits, MDPs, the RL problem
Tabular Methods4-8DP, MC, TD, n-step, planning
Function Approximation9-13Scale to large problems with parameterized models
Connections14-15Psychology, neuroscience — RL in nature
Applications & Frontiers16-17Where RL has worked, where it's going
The recurring themes: Several ideas have appeared again and again throughout this book: (1) Value estimation — predicting future reward is the engine of RL. (2) Prediction errors drive all learning, from Rescorla-Wagner to TD to policy gradients. (3) The model-free/model-based spectrum spans from pure caching to full planning. (4) The exploration-exploitation tradeoff appears in every chapter, from bandits to options.

What RL has achieved:

• Superhuman game play (Go, Atari, Chess)
• Real-world control (data centers, robotics)
• Deep neuroscience insights (dopamine = TD error)
• A unified framework for decision-making
• Foundation for RLHF in language models

What remains open:

• Sample efficiency for real-world tasks
• Safe exploration and deployment
• Continual learning without forgetting
• Reward design and alignment
• Online deep learning stability

Reinforcement learning occupies a unique position in AI. It is the only framework that addresses the full problem of intelligence: an agent, embedded in an environment, learning from consequences, pursuing goals. Supervised learning assumes a teacher. Unsupervised learning has no goals. RL is the complete package — perception, action, learning, planning — all driven by interaction with the world.

"The hardest part of artificial intelligence is the same as the hardest part
of natural intelligence: figuring out what to do next."
— Richard S. Sutton
Check: What is the central recurring idea throughout this entire book?