The open problems, future directions, and the road from here to general intelligence.
Throughout this book, we've studied a clean, well-defined framework: an agent interacts with an environment, observes states, takes actions, receives rewards, and learns to maximize cumulative reward. This framework is powerful — it gave us everything from TD-Gammon to AlphaGo Zero. But the real world is messier.
In the real world, the state isn't fully observable. The reward function isn't given — it must be designed or learned. Actions take variable amounts of time. Agents must learn many things simultaneously, not just one task. These are the frontier problems of RL, and solving them is essential for moving from game-playing to general intelligence.
These aren't just academic concerns. Every robotics application faces partial observability. Every real deployment requires reward design. Every complex task benefits from temporal abstraction. The ideas in this chapter are where the field is heading.
Throughout this book, value functions have predicted one thing: future cumulative reward. But there's nothing special about reward. An agent could learn to predict any signal that changes over time — temperature, distance to a wall, time since last event, or the output of any sensor. These are General Value Functions (GVFs).
A GVF is defined by three components: a policy (what behavior to predict under), a cumulant (the signal being predicted, replacing reward), and a termination function (how long into the future to look). The standard value function is just one GVF where the cumulant is reward, the policy is the target policy, and the discount is γ.
where C is the cumulant signal (any function of state), π is the prediction policy, and γ is the discount.
GVFs can be learned with standard TD methods — the same algorithms work, just with a different cumulant instead of reward. This means an agent can learn hundreds of predictive questions about its environment simultaneously, using off-policy methods, even while pursuing a different behavioral policy.
In deep RL, representation is everything. The agent must learn useful features from raw sensory input before it can learn a good policy. But reward signals are sparse — in many environments, the agent receives reward rarely, providing weak gradient signal for feature learning. Auxiliary tasks supplement the main reward with additional learning signals that shape better representations.
The idea is simple: add extra prediction heads to the network. One head predicts reward (the main task). Other heads predict pixel changes, immediate rewards, next-frame features, or GVF cumulants. All heads share the same representation layers. The auxiliary losses provide dense gradient signal that shapes features even when reward is sparse.
Auxiliary tasks are closely related to GVFs — each auxiliary prediction head is learning a GVF. The deep learning perspective adds the crucial insight: these auxiliary tasks aren't just for their own sake. They exist to shape the shared representation, making the main task easier. This is representation learning via prediction.
When you decide to "go to the kitchen," you don't plan individual muscle contractions. You invoke a high-level action — an option — that handles the details automatically. You plan at the level of rooms and destinations, not individual steps. This is temporal abstraction, and it's one of the most important ideas in RL.
An option (Sutton, Precup & Singh, 1999) is a temporally extended action defined by three components: an initiation set I (which states can the option start in?), an internal policy π (how does it behave?), and a termination condition β (when does it stop?). Primitive actions are just options that last one step.
Compare an agent using only primitive actions (left/right/up/down) vs. one using high-level options ("go to room"). Watch how temporal abstraction speeds up planning. Click rooms to set goals.
The options framework extends the standard MDP to a semi-MDP (SMDP), where actions can take variable numbers of time steps. All RL algorithms generalize to SMDPs: Q-learning over options, policy gradient over options, and even model-based planning over options. The math is nearly identical — just with variable-length time steps.
In Chapter 8, we learned that models let agents plan. A model says "if I take action a in state s, I'll transition to s' with reward r." For options, we need the same thing: "if I invoke option o in state s, where will I end up and how much reward will I accumulate?" These are option models.
An option model has two components: a reward model (the expected cumulative reward during the option) and a transition model (the probability distribution over termination states). Together, they describe what the option "does" without specifying how it does it internally.
Option models can be learned from experience, just like regular models (Chapter 8). The agent invokes an option, observes the cumulative reward and final state, and updates the model. Alternatively, they can be computed from the option's internal policy and the environment model. The Dyna architecture extends naturally: learn option models from experience, then plan using those models.
Every RL algorithm in this book (except this section) assumes the agent observes the full state. In a gridworld, the agent knows exactly which cell it occupies. In backgammon, the entire board is visible. But in the real world, you don't see the full state. A robot sees camera images, not the underlying physics. A trader sees prices, not the full economy. The state is partially observable.
The formal framework is the Partially Observable MDP (POMDP). There is a hidden state s, but the agent only sees an observation o that depends stochastically on s. Different states can produce the same observation (aliasing). The agent must maintain a belief state — a probability distribution over possible hidden states — and make decisions based on that belief.
Full Observability (MDP)
Agent sees state s directly. Policy maps states to actions: π(a|s). Value is a function of state: V(s). Standard RL works.
Partial Observability (POMDP)
Agent sees observation o, not state s. Must maintain belief b(s) = P(s|history). Policy maps beliefs to actions: π(a|b). Belief is the "state" of the POMDP.
A common practical approach is to stack observations: use the last k frames as input, as DQN did with Atari. This helps with short-term partial observability (like knowing velocity from two position observations) but doesn't solve long-term dependencies. Recurrent networks offer a more principled solution, maintaining a learned hidden state that summarizes observation history.
POMDPs define the hidden state in terms of the underlying dynamics. But the agent never sees those dynamics directly — it only sees observations and actions. Is there a way to represent state entirely in terms of observable quantities? Predictive State Representations (PSRs) offer exactly this.
A PSR represents the current state as a vector of predictions about future observations. Instead of saying "I believe I'm in state 3 with probability 0.7," a PSR says "if I take action LEFT, I predict I'll see observation A with probability 0.8; if I take action RIGHT, I predict observation B with probability 0.3." The state is the set of predictions.
PSRs have a beautiful theoretical property: for any finite POMDP, there exists a PSR of finite dimension that captures the complete state. The dimension of the PSR is at most the number of hidden states, and can be smaller. In some cases, a compact PSR exists even when the full belief state is high-dimensional.
In practice, PSRs haven't yet been as successful as recurrent neural networks for handling partial observability. But the idea — that state should be grounded in predictions about future experience — has influenced how we think about representation learning in deep RL.
Every RL application requires a reward function. For games, reward is obvious: win = +1, lose = -1. But for real-world tasks, designing the reward is often the hardest part. A robot tasked with cleaning a room might learn to cover the room with a cloth (technically "clean" but useless), or learn to push all objects off the table (the table is clean!).
The reward design problem is: how do you specify a reward function that actually incentivizes the behavior you want? This is harder than it sounds, because agents are remarkably creative at finding unintended shortcuts.
Compare agent behavior with sparse reward (only at goal) vs shaped reward (distance-based guidance). Shaped rewards speed learning but can introduce bias if designed carelessly.
Three approaches to reward design have emerged:
Reward shaping adds an extra term F(s,s') to the reward to guide the agent, while provably preserving the optimal policy (if the shaping function is a potential-based function). Inverse RL infers the reward function from demonstrations of desired behavior. Reward learning from human feedback (RLHF) has the human directly evaluate agent behavior and learns a reward model from those evaluations.
Despite its remarkable successes, RL faces fundamental open challenges. These are not incremental improvements — they are problems where current methods fundamentally struggle. Solving them would transform the field.
| Problem | Why It's Hard | Current Status |
|---|---|---|
| Sample efficiency | Real-world data is expensive | Model-based RL helps but doesn't solve it |
| Online deep learning | Neural nets are unstable with streaming data | Replay buffers are a workaround, not a solution |
| Safety | Exploration can be dangerous in the real world | Constrained MDPs, conservative methods |
| Curiosity / intrinsic motivation | Agents need to explore even without reward | Prediction error, count-based bonuses |
| Continual learning | Learning new tasks without forgetting old ones | Catastrophic forgetting remains unsolved |
| Representation learning | What features should the agent build? | Auxiliary tasks, world models help |
Curiosity and intrinsic motivation address a fundamental chicken-and-egg problem: how do you learn in environments with no reward signal? The idea is to create an internal "reward" for encountering novel or unpredictable states. This drives the agent to explore its environment systematically, building knowledge that will be useful when extrinsic reward does appear.
Safety is perhaps the most practically important open problem. A robot learning to walk might fall and break itself. A medical RL system making a bad decision could harm a patient. We need RL methods that explore safely, respect constraints, and degrade gracefully. Constrained MDPs and safe exploration are active areas of research, but far from solved.
This is the end of the road. Across 17 chapters, we've traveled from the simplest possible RL setting — a single bandit with two arms — to the frontiers of AI research. Let's step back and see the full arc.
| Part | Chapters | Core Idea |
|---|---|---|
| Foundations | 1-3 | Bandits, MDPs, the RL problem |
| Tabular Methods | 4-8 | DP, MC, TD, n-step, planning |
| Function Approximation | 9-13 | Scale to large problems with parameterized models |
| Connections | 14-15 | Psychology, neuroscience — RL in nature |
| Applications & Frontiers | 16-17 | Where RL has worked, where it's going |
What RL has achieved:
• Superhuman game play (Go, Atari, Chess)
• Real-world control (data centers, robotics)
• Deep neuroscience insights (dopamine = TD error)
• A unified framework for decision-making
• Foundation for RLHF in language models
What remains open:
• Sample efficiency for real-world tasks
• Safe exploration and deployment
• Continual learning without forgetting
• Reward design and alignment
• Online deep learning stability
Reinforcement learning occupies a unique position in AI. It is the only framework that addresses the full problem of intelligence: an agent, embedded in an environment, learning from consequences, pursuing goals. Supervised learning assumes a teacher. Unsupervised learning has no goals. RL is the complete package — perception, action, learning, planning — all driven by interaction with the world.