One policy, every task. Turn failures into free training data. After this, you build generalist agents.
You've trained a robot arm to reach position (0.5, 0.3, 0.2). It took 500,000 environment steps — about 3 days of wall-clock time on expensive hardware. Your collaborator says: "Great. Now make it reach (0.7, 0.1, 0.4). And (0.2, 0.8, 0.1). And 97 more positions."
One hundred reaching targets. Half a million steps each. That's 50 million total steps — roughly 300 days of robot time. But these tasks are nearly identical! The arm dynamics are the same. The physics is the same. The only thing that changes is the target location.
You're paying full price a hundred times for knowledge that overlaps 95%.
Single-task RL trains one policy per task. But most real-world task families share enormous structure — motor dynamics, physics, spatial reasoning. Multi-task RL trains one policy that takes a task identifier as input and performs any task on demand. Shared structure is learned once and amortized across all tasks.
Let's make this precise. Suppose you have K tasks, each requiring N steps as an independent specialist. The tasks share fraction f of their underlying knowledge (motor control, dynamics, spatial reasoning). A generalist policy learns the shared fraction once, paying approximately c · N for the shared component (where c > 1 accounts for multi-task interference), and then each task's unique component costs (1−f) · N:
For robot manipulation with K=100 tasks and f=0.8 (80% shared structure), the generalist needs roughly 5x fewer total steps than training 100 specialists.
Data efficiency is only half the story. Multi-task training also produces better features. When a network must solve many related tasks simultaneously, it can't memorize any single task's quirks — it must learn generalizable representations. This acts as a powerful regularizer.
Concrete evidence: In MT-Opt (Google, 2021), a single multi-task policy trained on 12 manipulation tasks outperformed 12 individual specialists on 10 of the 12 tasks. The generalist didn't just match — it beat the specialists, because the shared data gave it richer features.
Task A: "Push block to left." Task B: "Push block to right." Both require learning: (1) how to approach the block, (2) how to make contact, (3) how to maintain contact while moving. Only the direction vector differs. A generalist learns the shared approach-contact-push pipeline once. A specialist learns it from scratch each time.
Multi-task RL has failure modes. If tasks are genuinely unrelated (chess + protein folding), sharing a network forces competition for capacity with no benefit. Negative transfer occurs when optimizing one task actively hurts another. We'll cover mitigation strategies in Chapter 10.
Standard RL has one MDP. Multi-task RL has a family of MDPs that share some structure. Let's formalize this precisely.
A set of K MDPs, where each Mi = (Si, Ai, Ti, ri, γ). The MDPs may differ in any combination of: state space S, action space A, transition dynamics T, reward function r. They share a policy parameterization πθ.
A vector that uniquely specifies task i. Can be a one-hot index, a continuous embedding, a goal state, or a natural language string. The policy conditions on z: πθ(a | s, zi). Change z, change the behavior.
With K tasks, each with its own reward function ri and trajectory distribution pθ(i)(τ), the multi-task objective maximizes the average (or weighted sum) of all task objectives:
Symbol by symbol:
• θ — shared policy parameters (one neural network for all tasks)
• Ji(θ) — expected return on task i under the current policy
• wi — task weight. Uniform by default. Can be tuned to prioritize harder tasks.
• ri(s,a) — reward function specific to task i
| Component | Shared? | Example |
|---|---|---|
| State space S | Usually yes | Same robot joints + sensors |
| Action space A | Usually yes | Same motor commands |
| Dynamics T(s'|s,a) | Often yes | Same physics engine |
| Reward r(s,a) | NO — this differs | "Reach left" vs "reach right" |
| Initial state p(s0) | Often yes | Same starting position |
The most common scenario: same state/action/dynamics, different reward. This is exactly the setup where goal-conditioned RL shines — the reward is determined entirely by the goal.
The standard approach: concatenate the task identifier z with the state s, and feed [s; z] into a shared neural network. The first few layers learn task-agnostic features (physics, spatial reasoning). Later layers specialize based on z.
By feeding z alongside s, the network can learn: "when z says 'reach left', use the left-reaching motor patterns." The shared trunk still processes s through generic physics-aware layers. Only the later layers interpret z to produce task-specific behavior. This is the same idea as conditioning in diffusion models or class-conditional image generation.
Multi-task RL with an abstract task identifier z is powerful but raises a question: what should z be? A one-hot vector over 100 tasks? That doesn't generalize to task 101. A random embedding? Meaningless without training data for that embedding.
The elegant answer: let z be the goal state itself.
A policy that takes the current state s and a goal g as input, and outputs an action that moves the agent toward g. The goal g can be a desired state (e.g., target position), a set of desired states, or a state predicate (e.g., "block is on shelf").
This is a specific instance of multi-task RL where:
• The task identifier z = goal g
• The reward is goal-dependent: r(s, a, g) = f(s', g)
• Every possible goal defines a different task
Goals have a crucial property that arbitrary task IDs lack: compositional structure. Goal (0.5, 0.3) is "close to" goal (0.5, 0.4) in a meaningful way. If the policy knows how to reach (0.5, 0.3), reaching (0.5, 0.4) should require only a small adjustment. This gives us:
A goal-conditioned policy trained on goals sampled from some distribution can generalize to unseen goals — goals it was never trained on — because the goal space has continuous structure. This is impossible with one-hot task IDs. You can't interpolate between one-hot vectors meaningfully.
| Goal Type | Form of g | Example | Reward |
|---|---|---|---|
| Target state | g ∈ S | End-effector at (0.3, 0.7, 0.2) | -||s' - g|| |
| State predicate | g: S → {0,1} | "Block is on shelf" | 1[g(s') = 1] |
| Image goal | g = image | Photo of desired scene | -||embed(s') - embed(g)|| |
| Language goal | g = text | "Pick up red cup" | Learned reward model |
A robot arm with 3D end-effector position. State s = current (x, y, z). Goal g = desired (xg, yg, zg). The goal space is the full 3D workspace of the arm — an infinite set of tasks, all sharing the same dynamics. One policy handles all of them.
A goal-conditioned policy π(a|s,g) is called universal if it can achieve any reachable goal g in the goal space. "Reachable" means there exists some sequence of actions from s to g under the dynamics. The policy is a single function that implements all reachable tasks.
Standard Q-learning uses Q(s, a) — the expected return from taking action a in state s. Goal-conditioned Q-learning extends this to Q(s, a, g) — the expected return from taking a in s when trying to reach goal g.
The expected cumulative reward starting from state s, taking action a, and following policy π thereafter, with the goal of reaching g:
Expected return from state s when pursuing goal g under policy π:
Q(s, a, g) satisfies the standard Bellman recursion, but with goal-dependent reward:
Symbol by symbol:
• Q(s, a, g) — value of taking action a in state s while pursuing goal g
• r(s, a, g) — immediate reward. Depends on the goal g.
• γ — discount factor (typically 0.98-0.99). How much we value future rewards.
• V(s', g) — value of the next state s' (still pursuing same goal g)
• T(s'|s,a) — environment dynamics (unknown, estimated by sampling)
The most common choice: sparse binary reward.
Alternative: dense negative distance.
Dense reward (distance) gives gradient signal at every step → easier optimization. But it requires knowing the right distance metric, and may incentivize suboptimal paths (going toward goal in Euclidean space might hit a wall). Sparse reward (binary success) is more general and doesn't impose metric assumptions, but provides almost no learning signal. This sparse reward problem is the central challenge of GCRL.
Binary goal reward sounds clean: +1 when you reach the goal, 0 otherwise. But think about what this means for learning. The agent takes random actions. In a continuous 3D workspace, what's the probability that a random trajectory happens to land within ε of a specific goal?
Robot workspace: 1m × 1m × 1m cube. Goal tolerance ε = 2cm. A random trajectory's endpoint is roughly uniform in the workspace.
Volume of goal region: (4/3)π(0.02)3 ≈ 3.35 × 10-5 m3
Workspace volume: 1 m3
P(random success): 3.35 × 10-5 ≈ 0.003%
With 100 random trajectories per training batch: expected successes per batch = 0.003. You need ~30,000 batches (3 million trajectories) before seeing a single success by chance. That's roughly zero learning signal for the first several million steps.
Q-learning updates Q(s, a, g) using the Bellman target:
With sparse reward, r(s,a,g) = 0 for almost all transitions. And Q is initialized to 0. So the TD target is:
The update becomes Q ← Q + α[0 − Q] = Q(1 − α). Q stays at zero. No gradient signal propagates. The only way to get signal is if the agent accidentally reaches the goal — which we just showed happens with probability 0.003%.
Q-learning needs successful transitions to propagate value. But the policy needs non-zero Q-values to know which actions lead toward the goal. With sparse reward, you need success to learn, but you need learning to succeed. This circular dependency is why naive GCRL with sparse reward is essentially broken.
The probability of random success doesn't increase with training steps — it's a property of the goal size relative to the state space. In high dimensions (7-DOF robot arm = 14D state space), the ratio of goal volume to state volume is astronomically small. You could run for a billion steps and never see success. We need a fundamentally different approach.
Here's the breakthrough insight (Andrychowicz et al., 2017). You tried to reach goal g = (1, 1) but your trajectory ended at (0.5, 0.8). A failure! r = 0 at every step. Worthless transition? No.
Ask a different question: "If my goal had been (0.5, 0.8), would this trajectory be a success?" Yes! The agent did reach (0.5, 0.8). It just didn't happen to be the goal we asked for.
Every trajectory is a success — for some goal. Take a failed trajectory, replace the original goal with the state actually reached, and you get a free positive-reward training example. You're not changing the environment. You're changing the question: "What goal would have made this trajectory successful?"
This might feel like cheating. Are we corrupting the training signal? No. Here's why:
The transition (s, a, s') is physically real. It happened. The dynamics are the same regardless of what goal we assign. The only thing that changes is the reward label: r(s, a, gnew) instead of r(s, a, goriginal). Because reward is a function of (s, a, g) that we define, we can evaluate it for any g after the fact.
Q-learning is off-policy: it can learn from any transition (s, a, r, s') regardless of what policy generated it or what goal was being pursued. The Bellman equation doesn't care why action a was taken — only what happened as a result.
For any transition (st, at, st+1) and any goal g:
This update is valid for any g, not just the goal that was being pursued when at was taken. This is the off-policy property that makes HER work.
The future strategy (g' = a state visited later in the trajectory) has a key property: the agent actually reached g' from st in some number of steps. This means the relabeled transition provides not just a success signal, but a temporally consistent one — there's a path from st to g' that the agent demonstrated.
Episode length T = 50 steps. With k = 4 relabeled goals per transition using "future" strategy: original replay buffer gets 50 transitions (mostly r=0). After HER: 50 + 200 = 250 transitions, of which 200 have r=1. The ratio of positive-reward examples goes from ~0% to 80%. Q-learning now has massive gradient signal.
Let's work through a complete example to make HER concrete. A 2D robot arm with end-effector position as state.
• State: s = (x, y) position of end-effector
• Goal: g = (1.0, 1.0) — reach this point
• Reward: r(s, a, g) = 1 if ||s' − g|| < 0.1, else 0
• Discount: γ = 0.98
• Learning rate: α = 0.1
• Q initialized to 0 everywhere
For transition (s2, a2, s3) with g = (1.0, 1.0):
Using final strategy: relabel with g' = s3 = (0.5, 0.8).
Now re-evaluate transition (s2, a2, s3) with g' = (0.5, 0.8):
Non-zero Q-value! The agent has learned: "action 'move up' in state (0.5, 0.4) is good for reaching (0.5, 0.8)."
Now process transition (s1, a1, s2) with g' = (0.5, 0.8):
Value is propagating backward from the goal. After many iterations of replay, the entire trajectory gets non-zero Q-values for goal (0.5, 0.8). The policy learns: "to reach (0.5, 0.8), move right, then up-right, then up."
Without HER: 0 successful transitions, 0 gradient signal, no learning. With HER: every trajectory provides positive reward for the states it actually visited. Q-values propagate. The policy improves. And as the policy improves, it reaches states closer to the real goal, generating genuinely useful signal for the original task.
Episode length T, k relabeled goals per transition using "future" strategy. Derive: (1) Total transitions in replay buffer. (2) Expected number with r=1. (3) The positive-reward ratio as a function of T and k.
Total transitions: T (original) + k·T (relabeled) = T(1 + k)
Positive original: ~0 (episode failed)
Positive relabeled: For transition t, we pick k future states as goals. The transition at timestep t with relabeled goal g' = s_{t+1} always has r=1 (we immediately reached it). Other relabeled goals (g' = s_{t+2}, etc.) have r=0 for that specific transition but will give r=1 for later transitions. On average: each of the k·T relabeled transitions contributes exactly one "goal reached" moment somewhere in the trajectory. Total positive relabeled = k·T (each relabeled sub-trajectory has exactly one success at its terminal step).
Actually: The precise answer: for the "final" strategy (simplest), every transition relabeled with g'=s_T gives r=1 only at the last step. So positive = k (just the last transition, relabeled k times). For "future" strategy: transition at time t gets goals from {s_{t+1},...,s_T}. The transition (s_t, a_t, s_{t+1}) with g' = s_{t+1} gives immediate r=1. So for each t, one of the k relabeled versions (if k ≤ T-t) gives r=1. Total positive ≈ min(k, T-t) summed over t ≈ k·T/(k+1) in expectation.
Positive ratio ≈ k/(1+k). For k=4: ratio = 80%. For k=8: ratio = 89%.
So far we've discussed Q(s, a, g) conceptually. But how do we represent a function of three arguments that generalizes across all goals? A table? Impossible — the goal space is continuous. We need function approximation.
A neural network Qθ(s, a, g) that takes state, action, AND goal as input and outputs a scalar Q-value. "Universal" because one network handles all goals — it generalizes across the goal space via learned embeddings. (Schaul et al., 2015)
The simplest UVFA architecture concatenates state and goal embeddings:
Why separate encoders? The state might be a 14D joint configuration while the goal might be a 3D target position. They live in different spaces. The encoders map both into a common latent space where their relationship (how far the state is from achieving the goal) can be computed.
This is the payoff: a UVFA trained on goals sampled from some distribution can evaluate Q(s, a, g) for goals never seen during training. If it was trained on goals uniformly in [0,1]2 and you query g = (0.37, 0.82) — a point it never specifically trained on — the network interpolates smoothly because the goal encoder ψ learned a continuous mapping.
Train UVFA on 1000 random goals in a 2D workspace. At test time, give it goal (0.42, 0.73) — never seen in training. The goal encoder produces an embedding. The Q-network evaluates which action from the current state moves toward that embedding. No additional training needed. This is analogous to how a language model generalizes to sentences it's never seen — the embedding space has learned the structure of the domain.
Combining UVFA with HER gives a complete goal-conditioned RL system:
HER provides abundant positive-reward training data. UVFA generalizes that data across the continuous goal space. Together: even if you've only successfully reached 100 specific goals during training, the UVFA can guide the agent toward any goal in the workspace. HER solves the data problem; UVFA solves the generalization problem.
Goal g = target state works for reaching tasks. But how do you specify "sort the blocks by color" as a state vector? You can't — it's a semantic concept that doesn't reduce to a single target configuration (there are many valid sorted arrangements). The solution: express goals in natural language.
π(a | s, l) where l is a natural language instruction. The policy conditions on text instead of a state vector. Examples: "Pick up the red cup", "Push the block left", "Stack the blue on the green."
Same GCRL framework, but replace the goal encoder ψ(g) with a language encoder:
The language model maps semantically similar instructions to nearby embeddings. "Pick up the red cup" and "Grab the red mug" produce similar vectors → similar policies. This gives compositional generalization: if the policy knows "pick up" and knows "red cup", it can handle "pick up the red cup" even without seeing that exact combination.
| Goal Space | Expressiveness | Generalization | HER-Compatible? |
|---|---|---|---|
| One-hot task ID | Low (fixed K tasks) | None (can't interpolate) | No |
| State vector | Medium (spatial goals) | Continuous (nearby goals) | Yes (relabel with s') |
| Image | High (visual goals) | Good (visual similarity) | Yes (relabel with current image) |
| Language | Highest (any concept) | Compositional (new combos) | Partial (need captioner) |
State vectors can express "reach position X." Images can express "make it look like this." But only language can express: "Put the heavy things on the bottom shelf and the fragile things on the top shelf, but keep the chemicals away from food." Language allows arbitrary compositional constraints that no fixed-dimensional goal vector can represent.
Modern VLAs like RT-2, Octo, and OpenVLA are exactly language-conditioned GCRL policies. The architecture is:
With state goals: relabel g' = sT (trivial). With language goals: what language instruction does sT correspond to? You need a captioner — a model that looks at the achieved state and generates a description. "The robot's arm moved to the middle of the table" → relabel with "move arm to center." This is called language HER (LHER) and requires a separate language grounding model.
Multi-task RL isn't a free lunch. Sharing one network across many tasks creates optimization conflicts. Here are the three core problems and their solutions.
When training on task B actively hurts performance on task A. This happens when the optimal features for A and B conflict — the network can't represent both well simultaneously. Example: "push left" and "push right" need opposite motor patterns in the last layer but the same features in early layers.
Even without negative transfer, tasks may have conflicting gradients. If task A's gradient points north and task B's points south, the naive sum is zero — no progress on either task.
PCGrad (Yu et al., 2020) detects conflicting gradients and projects away the conflicting component:
After projection, gA' is orthogonal to gB — it helps task A without hurting task B.
Some tasks are harder than others. If you sample tasks uniformly, easy tasks dominate the gradient (they're already near-optimal so their gradients are small — wait, that's actually fine). The real issue: hard tasks have large, noisy gradients that overwhelm the stable gradients from easy tasks.
Uniform sampling: p(taski) = 1/K. Simple, often works. Performance-based: Sample harder tasks more often (lower current reward = higher sampling probability). Gradient-based: Weight tasks inversely to gradient magnitude (GradNorm, Chen et al. 2018). Uncertainty-based: Sample tasks where the policy is most uncertain (active learning).
When improving on a new task causes the network to "forget" how to do previously learned tasks. The new task's gradient updates overwrite weights that were important for old tasks. This is especially severe when tasks are trained sequentially rather than simultaneously.
Mitigations: (1) Experience replay — mix old task data into training. (2) EWC (Elastic Weight Consolidation) — penalize changes to weights important for old tasks. (3) Multi-head architectures — separate output heads per task, shared backbone.
| Problem | Symptom | Solution |
|---|---|---|
| Negative transfer | Multi-task worse than specialist | Task grouping, modular networks |
| Gradient conflict | Training stalls | PCGrad, gradient surgery |
| Catastrophic forgetting | Old tasks degrade over time | EWC, replay, multi-head |
| Task imbalance | Easy tasks converge, hard tasks don't | Performance-weighted sampling |
We started with a simple question: how do you scale RL from one task to many? The answer unfolded in three layers:
• Multi-task RL: Share one policy across multiple MDPs via a task identifier. Amortize learning shared structure.
• Goal-conditioned RL: Make the task identifier a goal state. Get continuous task spaces and zero-shot generalization for free.
• HER: Solve the sparse reward problem by relabeling failed trajectories as successes for a different goal. Turn every trajectory into useful training data.
HER is not just an engineering trick — it's a fundamental shift in how we think about RL data. Traditional RL: "this trajectory failed, throw it away." HER: "this trajectory succeeded at something — figure out what." It's the same transition viewed through a different lens. This principle (relabeling data post-hoc) appears throughout modern ML: contrastive learning, data augmentation, hindsight relabeling in language models.
| Aspect | Single-Task RL | Multi-Task RL | Goal-Conditioned RL |
|---|---|---|---|
| Policy | π(a|s) | π(a|s,z) | π(a|s,g) |
| Task space | One task | Finite set (K tasks) | Continuous (any goal) |
| New task cost | Full retraining | Add to training set | Zero-shot (just change g) |
| Reward | r(s,a) | ri(s,a) | r(s,a,g) = 1[s'∈G] |
| Data efficiency | Low (no sharing) | Medium (shared features) | High (HER augmentation) |
| Generalization | None | Within task set | To unseen goals |
| Limitation | One skill only | Fixed task catalog | Sparse reward problem |
RT-2, Octo, and OpenVLA are goal-conditioned policies where the "goal" is expressed in language and the "state" is a camera image. The same framework we built in this lecture — conditioning on a task specification, learning shared representations, generalizing to new tasks — is exactly what makes VLAs work. The scale changed (billion-parameter language models as goal encoders), but the principle is identical.
If you have demonstrations for many tasks, you can do multi-task imitation learning instead of multi-task RL. The policy π(a|s,z) is trained via supervised learning on (s, a, z) triples from demonstrations. This is exactly what large-scale robot learning (BC-Z, RT-1) does. Advantage: no reward engineering needed. Disadvantage: can't exceed demonstrator quality without RL fine-tuning.
Offline RL trains policies from fixed datasets without any environment interaction. Apply HER to an offline dataset: every trajectory (even failed ones) becomes training data for goals it actually reached. This makes offline goal-conditioned RL surprisingly effective — you can learn goal-reaching policies from random exploration data. (GoFAR, Yang et al., 2022)
| Paradigm | Task Specification | Key Paper | Modern Instance |
|---|---|---|---|
| Single-task RL | Hard-coded reward | DQN (2015) | Game-playing agents |
| Multi-task RL | Task ID / one-hot | Distral (2017) | MT-Opt |
| Goal-conditioned RL | State vector | UVFA (2015), HER (2017) | Robotic reaching |
| Language-conditioned RL | Natural language | Shridhar et al. (2022) | RT-2, Octo, OpenVLA |
From fixed reward functions → to parameterized goals → to language instructions. Each step makes the task space richer and the policy more general. The endpoint is a policy that takes arbitrary language as its goal and can do anything expressible in natural language. We're not there yet — but the framework is in place.