← Gleams
Stanford CS 224R · Lecture 12 · Deep RL

Multi-Task & Goal-Conditioned RL

One policy, every task. Turn failures into free training data. After this, you build generalist agents.

Multi-task formulation Goal-conditioned Bellman HER derivation Foundation for VLAs
Roadmap

What You'll Master

Chapter 01

Why Multi-Task?

You've trained a robot arm to reach position (0.5, 0.3, 0.2). It took 500,000 environment steps — about 3 days of wall-clock time on expensive hardware. Your collaborator says: "Great. Now make it reach (0.7, 0.1, 0.4). And (0.2, 0.8, 0.1). And 97 more positions."

One hundred reaching targets. Half a million steps each. That's 50 million total steps — roughly 300 days of robot time. But these tasks are nearly identical! The arm dynamics are the same. The physics is the same. The only thing that changes is the target location.

You're paying full price a hundred times for knowledge that overlaps 95%.

The Core Insight

Single-task RL trains one policy per task. But most real-world task families share enormous structure — motor dynamics, physics, spatial reasoning. Multi-task RL trains one policy that takes a task identifier as input and performs any task on demand. Shared structure is learned once and amortized across all tasks.

The Specialist vs. Generalist Trade-off

Let's make this precise. Suppose you have K tasks, each requiring N steps as an independent specialist. The tasks share fraction f of their underlying knowledge (motor control, dynamics, spatial reasoning). A generalist policy learns the shared fraction once, paying approximately c · N for the shared component (where c > 1 accounts for multi-task interference), and then each task's unique component costs (1−f) · N:

Training Cost Comparison Specialist total: K · N
Generalist total: c · N + K · (1−f) · N

Generalist wins when f > 1 − (c−1)/(K−1). For K=100, c=3, f > 2% suffices!

For robot manipulation with K=100 tasks and f=0.8 (80% shared structure), the generalist needs roughly 5x fewer total steps than training 100 specialists.

Specialist (red) vs Generalist (gold) training cost as task count grows

Transfer Learning: The Deeper Benefit

Data efficiency is only half the story. Multi-task training also produces better features. When a network must solve many related tasks simultaneously, it can't memorize any single task's quirks — it must learn generalizable representations. This acts as a powerful regularizer.

Concrete evidence: In MT-Opt (Google, 2021), a single multi-task policy trained on 12 manipulation tasks outperformed 12 individual specialists on 10 of the 12 tasks. The generalist didn't just match — it beat the specialists, because the shared data gave it richer features.

Example — Feature Reuse

Task A: "Push block to left." Task B: "Push block to right." Both require learning: (1) how to approach the block, (2) how to make contact, (3) how to maintain contact while moving. Only the direction vector differs. A generalist learns the shared approach-contact-push pipeline once. A specialist learns it from scratch each time.

When Multi-Task Fails

Multi-task RL has failure modes. If tasks are genuinely unrelated (chess + protein folding), sharing a network forces competition for capacity with no benefit. Negative transfer occurs when optimizing one task actively hurts another. We'll cover mitigation strategies in Chapter 10.

Chapter 02

Multi-Task RL Formulation

Standard RL has one MDP. Multi-task RL has a family of MDPs that share some structure. Let's formalize this precisely.

Definition
Task Family — {Mi}i=1..K

A set of K MDPs, where each Mi = (Si, Ai, Ti, ri, γ). The MDPs may differ in any combination of: state space S, action space A, transition dynamics T, reward function r. They share a policy parameterization πθ.

Definition
Task Identifier — zi

A vector that uniquely specifies task i. Can be a one-hot index, a continuous embedding, a goal state, or a natural language string. The policy conditions on z: πθ(a | s, zi). Change z, change the behavior.

The Multi-Task Objective

With K tasks, each with its own reward function ri and trajectory distribution pθ(i)(τ), the multi-task objective maximizes the average (or weighted sum) of all task objectives:

Multi-Task RL Objective maxθ JMT(θ) = Σi=1K wi · Ji(θ)

where Ji(θ) = 𝔼τ ~ pθ(i)(τ)[ Σt ri(st, at) ]

wi = task weight (uniform: wi = 1/K). Controls which tasks matter more.

Symbol by symbol:

θ — shared policy parameters (one neural network for all tasks)

Ji(θ) — expected return on task i under the current policy

wi — task weight. Uniform by default. Can be tuned to prioritize harder tasks.

ri(s,a) — reward function specific to task i

What Can Differ Across Tasks?

ComponentShared?Example
State space SUsually yesSame robot joints + sensors
Action space AUsually yesSame motor commands
Dynamics T(s'|s,a)Often yesSame physics engine
Reward r(s,a)NO — this differs"Reach left" vs "reach right"
Initial state p(s0)Often yesSame starting position

The most common scenario: same state/action/dynamics, different reward. This is exactly the setup where goal-conditioned RL shines — the reward is determined entirely by the goal.

Network Architecture: How to Condition on z

Multi-task policy architecture: state + task ID → shared trunk → action

The standard approach: concatenate the task identifier z with the state s, and feed [s; z] into a shared neural network. The first few layers learn task-agnostic features (physics, spatial reasoning). Later layers specialize based on z.

Conditioning via Concatenation πθ(a | s, z) = softmax( fθ([s; z]) )

[s; z] = concatenation of state vector and task identifier
Why Concatenation Works

By feeding z alongside s, the network can learn: "when z says 'reach left', use the left-reaching motor patterns." The shared trunk still processes s through generic physics-aware layers. Only the later layers interpret z to produce task-specific behavior. This is the same idea as conditioning in diffusion models or class-conditional image generation.

Chapter 03

Goal-Conditioned RL

Multi-task RL with an abstract task identifier z is powerful but raises a question: what should z be? A one-hot vector over 100 tasks? That doesn't generalize to task 101. A random embedding? Meaningless without training data for that embedding.

The elegant answer: let z be the goal state itself.

Definition
Goal-Conditioned Policy — π(a | s, g)

A policy that takes the current state s and a goal g as input, and outputs an action that moves the agent toward g. The goal g can be a desired state (e.g., target position), a set of desired states, or a state predicate (e.g., "block is on shelf").

This is a specific instance of multi-task RL where:

• The task identifier z = goal g

• The reward is goal-dependent: r(s, a, g) = f(s', g)

• Every possible goal defines a different task

Why Goals Are the Natural Task Space

Goals have a crucial property that arbitrary task IDs lack: compositional structure. Goal (0.5, 0.3) is "close to" goal (0.5, 0.4) in a meaningful way. If the policy knows how to reach (0.5, 0.3), reaching (0.5, 0.4) should require only a small adjustment. This gives us:

Zero-Shot Generalization

A goal-conditioned policy trained on goals sampled from some distribution can generalize to unseen goals — goals it was never trained on — because the goal space has continuous structure. This is impossible with one-hot task IDs. You can't interpolate between one-hot vectors meaningfully.

Types of Goal Specifications

Goal TypeForm of gExampleReward
Target stateg ∈ SEnd-effector at (0.3, 0.7, 0.2)-||s' - g||
State predicateg: S → {0,1}"Block is on shelf"1[g(s') = 1]
Image goalg = imagePhoto of desired scene-||embed(s') - embed(g)||
Language goalg = text"Pick up red cup"Learned reward model
Goal-conditioned policy: same state, different goals → different actions
Continuous Goal Space

A robot arm with 3D end-effector position. State s = current (x, y, z). Goal g = desired (xg, yg, zg). The goal space is the full 3D workspace of the arm — an infinite set of tasks, all sharing the same dynamics. One policy handles all of them.

Definition
Universal Policy

A goal-conditioned policy π(a|s,g) is called universal if it can achieve any reachable goal g in the goal space. "Reachable" means there exists some sequence of actions from s to g under the dynamics. The policy is a single function that implements all reachable tasks.

Chapter 04

Goal-Conditioned Bellman Equations

Standard Q-learning uses Q(s, a) — the expected return from taking action a in state s. Goal-conditioned Q-learning extends this to Q(s, a, g) — the expected return from taking a in s when trying to reach goal g.

Definition
Goal-Conditioned Q-Function — Q(s, a, g)

The expected cumulative reward starting from state s, taking action a, and following policy π thereafter, with the goal of reaching g:

Qπ(s, a, g) = 𝔼π[ Σk=0 γk r(st+k, at+k, g) | st=s, at=a ]
Definition
Goal-Conditioned Value Function — V(s, g)

Expected return from state s when pursuing goal g under policy π:

Vπ(s, g) = 𝔼a~π(·|s,g)[ Qπ(s, a, g) ]

The Goal-Conditioned Bellman Equation

Q(s, a, g) satisfies the standard Bellman recursion, but with goal-dependent reward:

Goal-Conditioned Bellman Equation Q(s, a, g) = r(s, a, g) + γ · 𝔼s'~T(·|s,a)[ V(s', g) ]

= r(s, a, g) + γ · 𝔼s'~T(·|s,a)[ maxa' Q(s', a', g) ]

Second line uses the Bellman optimality equation (for Q-learning / DQN)

Symbol by symbol:

Q(s, a, g) — value of taking action a in state s while pursuing goal g

r(s, a, g) — immediate reward. Depends on the goal g.

γ — discount factor (typically 0.98-0.99). How much we value future rewards.

V(s', g) — value of the next state s' (still pursuing same goal g)

T(s'|s,a) — environment dynamics (unknown, estimated by sampling)

Reward Design for Goal-Conditioned RL

The most common choice: sparse binary reward.

Sparse Goal Reward r(s, a, g) = 1[ ||s' − g|| < ε ] (= 1 if goal reached, 0 otherwise)

ε = tolerance threshold. s' = next state after taking a in s.

Alternative: dense negative distance.

Dense Goal Reward r(s, a, g) = −||s' − g||2 (= negative L2 distance to goal)
Dense vs Sparse: A Fundamental Trade-off

Dense reward (distance) gives gradient signal at every step → easier optimization. But it requires knowing the right distance metric, and may incentivize suboptimal paths (going toward goal in Euclidean space might hit a wall). Sparse reward (binary success) is more general and doesn't impose metric assumptions, but provides almost no learning signal. This sparse reward problem is the central challenge of GCRL.

Q-value propagation: Bellman backup from goal outward (click to place goal)
Chapter 05

The Sparse Reward Problem

Binary goal reward sounds clean: +1 when you reach the goal, 0 otherwise. But think about what this means for learning. The agent takes random actions. In a continuous 3D workspace, what's the probability that a random trajectory happens to land within ε of a specific goal?

Hand Calculation — How Rare is Success?

Robot workspace: 1m × 1m × 1m cube. Goal tolerance ε = 2cm. A random trajectory's endpoint is roughly uniform in the workspace.

Volume of goal region: (4/3)π(0.02)3 ≈ 3.35 × 10-5 m3

Workspace volume: 1 m3

P(random success): 3.35 × 10-5 ≈ 0.003%

With 100 random trajectories per training batch: expected successes per batch = 0.003. You need ~30,000 batches (3 million trajectories) before seeing a single success by chance. That's roughly zero learning signal for the first several million steps.

Why Q-Learning Fails with Sparse Reward

Q-learning updates Q(s, a, g) using the Bellman target:

Q-Learning Update Q(s, a, g) ← Q(s, a, g) + α [ r(s,a,g) + γ maxa' Q(s',a',g) − Q(s,a,g) ]

The TD target is: r + γ max Q(s', a', g)

With sparse reward, r(s,a,g) = 0 for almost all transitions. And Q is initialized to 0. So the TD target is:

TD target = 0 + γ · max(0, 0, 0, ..., 0) = 0

The update becomes Q ← Q + α[0 − Q] = Q(1 − α). Q stays at zero. No gradient signal propagates. The only way to get signal is if the agent accidentally reaches the goal — which we just showed happens with probability 0.003%.

The Chicken-and-Egg Problem

Q-learning needs successful transitions to propagate value. But the policy needs non-zero Q-values to know which actions lead toward the goal. With sparse reward, you need success to learn, but you need learning to succeed. This circular dependency is why naive GCRL with sparse reward is essentially broken.

Random exploration in 2D: red = goal, blue = trajectories (almost never reach goal)
This Is Not Fixable by Training Longer

The probability of random success doesn't increase with training steps — it's a property of the goal size relative to the state space. In high dimensions (7-DOF robot arm = 14D state space), the ratio of goal volume to state volume is astronomically small. You could run for a billion steps and never see success. We need a fundamentally different approach.

Chapter 06

Hindsight Experience Replay

Here's the breakthrough insight (Andrychowicz et al., 2017). You tried to reach goal g = (1, 1) but your trajectory ended at (0.5, 0.8). A failure! r = 0 at every step. Worthless transition? No.

Ask a different question: "If my goal had been (0.5, 0.8), would this trajectory be a success?" Yes! The agent did reach (0.5, 0.8). It just didn't happen to be the goal we asked for.

The HER Insight: Relabel the Goal

Every trajectory is a success — for some goal. Take a failed trajectory, replace the original goal with the state actually reached, and you get a free positive-reward training example. You're not changing the environment. You're changing the question: "What goal would have made this trajectory successful?"

Why Relabeling Is Valid

This might feel like cheating. Are we corrupting the training signal? No. Here's why:

The transition (s, a, s') is physically real. It happened. The dynamics are the same regardless of what goal we assign. The only thing that changes is the reward label: r(s, a, gnew) instead of r(s, a, goriginal). Because reward is a function of (s, a, g) that we define, we can evaluate it for any g after the fact.

Q-learning is off-policy: it can learn from any transition (s, a, r, s') regardless of what policy generated it or what goal was being pursued. The Bellman equation doesn't care why action a was taken — only what happened as a result.

Definition
Off-Policy Validity of HER

For any transition (st, at, st+1) and any goal g:

Q(st, at, g) ← Q(st, at, g) + α[ r(st, at, g) + γ maxa' Q(st+1, a', g) − Q(st, at, g) ]

This update is valid for any g, not just the goal that was being pursued when at was taken. This is the off-policy property that makes HER work.

The HER Algorithm

Algorithm: Hindsight Experience Replay (HER)
  1. For each episode:
    1. Sample goal g from goal distribution
    2. Collect trajectory τ = (s0, a0, s1, ..., sT) pursuing g
    3. Store original transitions in replay buffer: (st, at, r(st,at,g), st+1, g) for all t
    4. Hindsight relabeling: For each transition t, select k additional goals g' using a strategy:
      • future: g' = st' for random t' > t (a state visited later in same episode)
      • final: g' = sT (the final state reached)
      • episode: g' = st' for random t' in episode
    5. Store relabeled transitions: (st, at, r(st,at,g'), st+1, g') with r = 1 (success!)
  2. Train Q(s,a,g) using transitions sampled from the replay buffer (mix of original + relabeled)

Why "Future" Strategy Works Best

The future strategy (g' = a state visited later in the trajectory) has a key property: the agent actually reached g' from st in some number of steps. This means the relabeled transition provides not just a success signal, but a temporally consistent one — there's a path from st to g' that the agent demonstrated.

HER relabeling: original trajectory (failed) → relabeled goals (all successes)
Concrete Numbers

Episode length T = 50 steps. With k = 4 relabeled goals per transition using "future" strategy: original replay buffer gets 50 transitions (mostly r=0). After HER: 50 + 200 = 250 transitions, of which 200 have r=1. The ratio of positive-reward examples goes from ~0% to 80%. Q-learning now has massive gradient signal.

Chapter 07

HER Hand Calculation

Let's work through a complete example to make HER concrete. A 2D robot arm with end-effector position as state.

Setup

State: s = (x, y) position of end-effector

Goal: g = (1.0, 1.0) — reach this point

Reward: r(s, a, g) = 1 if ||s' − g|| < 0.1, else 0

Discount: γ = 0.98

Learning rate: α = 0.1

Q initialized to 0 everywhere

The Failed Trajectory

Trajectory (pursuing g = (1.0, 1.0)) s0 = (0.0, 0.0), a0 = "move right" → s1 = (0.3, 0.1)
s1 = (0.3, 0.1), a1 = "move up-right" → s2 = (0.5, 0.4)
s2 = (0.5, 0.4), a2 = "move up" → s3 = (0.5, 0.8)

Final state (0.5, 0.8). Distance to goal: ||(0.5,0.8)-(1.0,1.0)|| = 0.54 > 0.1. FAILURE.

Step 1: Original Q-updates (all zeros)

For transition (s2, a2, s3) with g = (1.0, 1.0):

r(s2, a2, g) = 1[||(0.5,0.8) − (1.0,1.0)|| < 0.1] = 1[0.54 < 0.1] = 0

TD target = 0 + 0.98 · maxa' Q(s3, a', g) = 0 + 0.98 · 0 = 0

Q(s2, a2, g) ← 0 + 0.1 · [0 − 0] = 0

No learning. Same for all other transitions. All Q-values stay at 0.

Step 2: HER Relabeling (free successes!)

Using final strategy: relabel with g' = s3 = (0.5, 0.8).

Now re-evaluate transition (s2, a2, s3) with g' = (0.5, 0.8):

HER Relabeled Update r(s2, a2, g') = 1[||(0.5,0.8) − (0.5,0.8)|| < 0.1] = 1[0 < 0.1] = 1

TD target = 1 + 0.98 · maxa' Q(s3, a', g') = 1 + 0 = 1

Q(s2, a2, g') ← 0 + 0.1 · [1 − 0] = 0.1

Non-zero Q-value! The agent has learned: "action 'move up' in state (0.5, 0.4) is good for reaching (0.5, 0.8)."

Step 3: Bellman Propagation Backward

Now process transition (s1, a1, s2) with g' = (0.5, 0.8):

r(s1, a1, g') = 1[||(0.5,0.4) − (0.5,0.8)|| < 0.1] = 1[0.4 < 0.1] = 0

TD target = 0 + 0.98 · maxa' Q(s2, a', g') = 0.98 · 0.1 = 0.098

Q(s1, a1, g') ← 0 + 0.1 · [0.098 − 0] = 0.0098

Value is propagating backward from the goal. After many iterations of replay, the entire trajectory gets non-zero Q-values for goal (0.5, 0.8). The policy learns: "to reach (0.5, 0.8), move right, then up-right, then up."

The Magic of HER

Without HER: 0 successful transitions, 0 gradient signal, no learning. With HER: every trajectory provides positive reward for the states it actually visited. Q-values propagate. The policy improves. And as the policy improves, it reaches states closer to the real goal, generating genuinely useful signal for the original task.

Q-value propagation after HER relabeling (gold = high Q, dark = low Q)
🔨 Derivation Derive the Positive-Reward Ratio in HER ✓ ATTEMPTED

Episode length T, k relabeled goals per transition using "future" strategy. Derive: (1) Total transitions in replay buffer. (2) Expected number with r=1. (3) The positive-reward ratio as a function of T and k.

T transitions from the original trajectory. All have r=0 (assuming the episode failed). These go into the buffer as-is.
For each of the T transitions, we create k relabeled versions. The "future" strategy picks g' = some state visited after t. Since g' is a state actually reached from s_t, and the last transition's relabeled goal is s_T (which is s_T itself), r=1 for the final step. For intermediate steps: r=1 only if we relabel with s_{t+1}. On average with "future" strategy, about 1 in (T-t) relabeled goals will give immediate reward.

Total transitions: T (original) + k·T (relabeled) = T(1 + k)

Positive original: ~0 (episode failed)

Positive relabeled: For transition t, we pick k future states as goals. The transition at timestep t with relabeled goal g' = s_{t+1} always has r=1 (we immediately reached it). Other relabeled goals (g' = s_{t+2}, etc.) have r=0 for that specific transition but will give r=1 for later transitions. On average: each of the k·T relabeled transitions contributes exactly one "goal reached" moment somewhere in the trajectory. Total positive relabeled = k·T (each relabeled sub-trajectory has exactly one success at its terminal step).

Actually: The precise answer: for the "final" strategy (simplest), every transition relabeled with g'=s_T gives r=1 only at the last step. So positive = k (just the last transition, relabeled k times). For "future" strategy: transition at time t gets goals from {s_{t+1},...,s_T}. The transition (s_t, a_t, s_{t+1}) with g' = s_{t+1} gives immediate r=1. So for each t, one of the k relabeled versions (if k ≤ T-t) gives r=1. Total positive ≈ min(k, T-t) summed over t ≈ k·T/(k+1) in expectation.

Positive ratio ≈ k/(1+k). For k=4: ratio = 80%. For k=8: ratio = 89%.

Chapter 08

Universal Value Function Approximation

So far we've discussed Q(s, a, g) conceptually. But how do we represent a function of three arguments that generalizes across all goals? A table? Impossible — the goal space is continuous. We need function approximation.

Definition
Universal Value Function Approximator (UVFA)

A neural network Qθ(s, a, g) that takes state, action, AND goal as input and outputs a scalar Q-value. "Universal" because one network handles all goals — it generalizes across the goal space via learned embeddings. (Schaul et al., 2015)

Architecture

The simplest UVFA architecture concatenates state and goal embeddings:

UVFA Architecture Qθ(s, a, g) = MLP( [φ(s); ψ(g); a] )

where φ: S → ℝd (state encoder)
ψ: G → ℝd (goal encoder)

Both encoders trained end-to-end with Q-learning loss

Why separate encoders? The state might be a 14D joint configuration while the goal might be a 3D target position. They live in different spaces. The encoders map both into a common latent space where their relationship (how far the state is from achieving the goal) can be computed.

Zero-Shot Transfer to New Goals

This is the payoff: a UVFA trained on goals sampled from some distribution can evaluate Q(s, a, g) for goals never seen during training. If it was trained on goals uniformly in [0,1]2 and you query g = (0.37, 0.82) — a point it never specifically trained on — the network interpolates smoothly because the goal encoder ψ learned a continuous mapping.

Zero-Shot Example

Train UVFA on 1000 random goals in a 2D workspace. At test time, give it goal (0.42, 0.73) — never seen in training. The goal encoder produces an embedding. The Q-network evaluates which action from the current state moves toward that embedding. No additional training needed. This is analogous to how a language model generalizes to sentences it's never seen — the embedding space has learned the structure of the domain.

UVFA generalization: trained goals (gold dots) vs test goals (green dots) — smooth Q-value field

UVFA + HER: The Full Pipeline

Combining UVFA with HER gives a complete goal-conditioned RL system:

Algorithm: UVFA + HER (Complete Goal-Conditioned Q-Learning)
  1. Initialize Qθ(s, a, g) network and replay buffer B
  2. For each episode:
    1. Sample goal g ~ p(g)
    2. Collect trajectory {(st, at, st+1)} using ε-greedy on Qθ(s, a, g)
    3. Store in B: original transitions + HER-relabeled transitions
  3. For each training step:
    1. Sample minibatch from B: {(s, a, r, s', g)}
    2. Compute target: y = r + γ maxa' Qθ'(s', a', g)
    3. Update: θ ← θ − α ∇θ (Qθ(s,a,g) − y)2
Why UVFA + HER is so Powerful

HER provides abundant positive-reward training data. UVFA generalizes that data across the continuous goal space. Together: even if you've only successfully reached 100 specific goals during training, the UVFA can guide the agent toward any goal in the workspace. HER solves the data problem; UVFA solves the generalization problem.

Chapter 09

Language-Conditioned Policies

Goal g = target state works for reaching tasks. But how do you specify "sort the blocks by color" as a state vector? You can't — it's a semantic concept that doesn't reduce to a single target configuration (there are many valid sorted arrangements). The solution: express goals in natural language.

Definition
Language-Conditioned Policy

π(a | s, l) where l is a natural language instruction. The policy conditions on text instead of a state vector. Examples: "Pick up the red cup", "Push the block left", "Stack the blue on the green."

Architecture: Goal Encoder = Language Model

Same GCRL framework, but replace the goal encoder ψ(g) with a language encoder:

Language-Conditioned Architecture πθ(a | s, l) = MLP( [φ(s); LM(l)] )

where LM: text → ℝd (language encoder, e.g., CLIP, BERT, T5)
φ: S → ℝd (state encoder)

LM is typically frozen (pretrained). Only φ and MLP are trained with RL.

The language model maps semantically similar instructions to nearby embeddings. "Pick up the red cup" and "Grab the red mug" produce similar vectors → similar policies. This gives compositional generalization: if the policy knows "pick up" and knows "red cup", it can handle "pick up the red cup" even without seeing that exact combination.

Language as Universal Goal Space

Goal SpaceExpressivenessGeneralizationHER-Compatible?
One-hot task IDLow (fixed K tasks)None (can't interpolate)No
State vectorMedium (spatial goals)Continuous (nearby goals)Yes (relabel with s')
ImageHigh (visual goals)Good (visual similarity)Yes (relabel with current image)
LanguageHighest (any concept)Compositional (new combos)Partial (need captioner)
Language = The Ultimate Task Interface

State vectors can express "reach position X." Images can express "make it look like this." But only language can express: "Put the heavy things on the bottom shelf and the fragile things on the top shelf, but keep the chemicals away from food." Language allows arbitrary compositional constraints that no fixed-dimensional goal vector can represent.

Connection to Vision-Language-Action Models (VLAs)

Modern VLAs like RT-2, Octo, and OpenVLA are exactly language-conditioned GCRL policies. The architecture is:

VLA Architecture (= Language-GCRL) at = VLA(imaget, language_instruction)

= πθ(a | st, l) where st = image, l = instruction

Same GCRL framework. "Goal" = language instruction. "State" = camera image.
Language embedding space: similar instructions cluster together
HER with Language Goals is Hard

With state goals: relabel g' = sT (trivial). With language goals: what language instruction does sT correspond to? You need a captioner — a model that looks at the achieved state and generates a description. "The robot's arm moved to the middle of the table" → relabel with "move arm to center." This is called language HER (LHER) and requires a separate language grounding model.

Chapter 10

Multi-Task Challenges

Multi-task RL isn't a free lunch. Sharing one network across many tasks creates optimization conflicts. Here are the three core problems and their solutions.

Problem 1: Negative Transfer

Definition
Negative Transfer

When training on task B actively hurts performance on task A. This happens when the optimal features for A and B conflict — the network can't represent both well simultaneously. Example: "push left" and "push right" need opposite motor patterns in the last layer but the same features in early layers.

Problem 2: Task Interference (Gradient Conflict)

Even without negative transfer, tasks may have conflicting gradients. If task A's gradient points north and task B's points south, the naive sum is zero — no progress on either task.

Gradient Conflict gA = ∇θ JA, gB = ∇θ JB

Conflict: gA · gB < 0 (negative cosine similarity)

Naive: θ ← θ + α(gA + gB) ≈ θ + α · 0 = θ (no learning!)

Solution: PCGrad (Projecting Conflicting Gradients)

PCGrad (Yu et al., 2020) detects conflicting gradients and projects away the conflicting component:

PCGrad Algorithm If gA · gB < 0:
gA' = gA − (gA · gB / ||gB||2) gB

Remove the component of gA that points against gB

After projection, gA' is orthogonal to gB — it helps task A without hurting task B.

Gradient conflict and PCGrad projection (drag arrows to see effect)

Problem 3: Task Balancing

Some tasks are harder than others. If you sample tasks uniformly, easy tasks dominate the gradient (they're already near-optimal so their gradients are small — wait, that's actually fine). The real issue: hard tasks have large, noisy gradients that overwhelm the stable gradients from easy tasks.

Task Balancing Strategies

Uniform sampling: p(taski) = 1/K. Simple, often works. Performance-based: Sample harder tasks more often (lower current reward = higher sampling probability). Gradient-based: Weight tasks inversely to gradient magnitude (GradNorm, Chen et al. 2018). Uncertainty-based: Sample tasks where the policy is most uncertain (active learning).

Problem 4: Catastrophic Forgetting

Definition
Catastrophic Forgetting

When improving on a new task causes the network to "forget" how to do previously learned tasks. The new task's gradient updates overwrite weights that were important for old tasks. This is especially severe when tasks are trained sequentially rather than simultaneously.

Mitigations: (1) Experience replay — mix old task data into training. (2) EWC (Elastic Weight Consolidation) — penalize changes to weights important for old tasks. (3) Multi-head architectures — separate output heads per task, shared backbone.

ProblemSymptomSolution
Negative transferMulti-task worse than specialistTask grouping, modular networks
Gradient conflictTraining stallsPCGrad, gradient surgery
Catastrophic forgettingOld tasks degrade over timeEWC, replay, multi-head
Task imbalanceEasy tasks converge, hard tasks don'tPerformance-weighted sampling
Chapter 11

Summary & Cheat Sheet

The Big Picture

We started with a simple question: how do you scale RL from one task to many? The answer unfolded in three layers:

Multi-task RL: Share one policy across multiple MDPs via a task identifier. Amortize learning shared structure.

Goal-conditioned RL: Make the task identifier a goal state. Get continuous task spaces and zero-shot generalization for free.

HER: Solve the sparse reward problem by relabeling failed trajectories as successes for a different goal. Turn every trajectory into useful training data.

The Key Breakthrough: HER

HER is not just an engineering trick — it's a fundamental shift in how we think about RL data. Traditional RL: "this trajectory failed, throw it away." HER: "this trajectory succeeded at something — figure out what." It's the same transition viewed through a different lens. This principle (relabeling data post-hoc) appears throughout modern ML: contrastive learning, data augmentation, hindsight relabeling in language models.

Comparison Table

AspectSingle-Task RLMulti-Task RLGoal-Conditioned RL
Policyπ(a|s)π(a|s,z)π(a|s,g)
Task spaceOne taskFinite set (K tasks)Continuous (any goal)
New task costFull retrainingAdd to training setZero-shot (just change g)
Rewardr(s,a)ri(s,a)r(s,a,g) = 1[s'∈G]
Data efficiencyLow (no sharing)Medium (shared features)High (HER augmentation)
GeneralizationNoneWithin task setTo unseen goals
LimitationOne skill onlyFixed task catalogSparse reward problem

Key Equations Cheat Sheet

Multi-Task Objective JMT(θ) = Σi wi 𝔼τ~πθ[ Σt ri(st, at) ]
Goal-Conditioned Bellman Q(s,a,g) = r(s,a,g) + γ 𝔼s'[ maxa' Q(s',a',g) ]
HER Relabeling (st, at, 0, st+1, g) → (st, at, 1, st+1, g'=sT)
PCGrad Projection gA' = gA − (gA·gB / ||gB||2) gB if gA·gB < 0
Chapter 12

Connections & What's Next

🔗 Connection
VLAs = Language-Conditioned GCRL at Scale
Goal-Conditioned RL
π(a | s, g)
State = robot proprioception
Goal = target state vector
Reward = sparse binary
Vision-Language-Action Model
π(a | image, language)
State = camera image
Goal = language instruction
Reward = success classifier

RT-2, Octo, and OpenVLA are goal-conditioned policies where the "goal" is expressed in language and the "state" is a camera image. The same framework we built in this lecture — conditioning on a task specification, learning shared representations, generalizing to new tasks — is exactly what makes VLAs work. The scale changed (billion-parameter language models as goal encoders), but the principle is identical.

🔗 Connection
Multi-Task Behavioral Cloning

If you have demonstrations for many tasks, you can do multi-task imitation learning instead of multi-task RL. The policy π(a|s,z) is trained via supervised learning on (s, a, z) triples from demonstrations. This is exactly what large-scale robot learning (BC-Z, RT-1) does. Advantage: no reward engineering needed. Disadvantage: can't exceed demonstrator quality without RL fine-tuning.

🔗 Connection
HER + Offline RL = Free Goal-Conditioned Data

Offline RL trains policies from fixed datasets without any environment interaction. Apply HER to an offline dataset: every trajectory (even failed ones) becomes training data for goals it actually reached. This makes offline goal-conditioned RL surprisingly effective — you can learn goal-reaching policies from random exploration data. (GoFAR, Yang et al., 2022)

The Progression

ParadigmTask SpecificationKey PaperModern Instance
Single-task RLHard-coded rewardDQN (2015)Game-playing agents
Multi-task RLTask ID / one-hotDistral (2017)MT-Opt
Goal-conditioned RLState vectorUVFA (2015), HER (2017)Robotic reaching
Language-conditioned RLNatural languageShridhar et al. (2022)RT-2, Octo, OpenVLA
The Arc of Progress

From fixed reward functions → to parameterized goals → to language instructions. Each step makes the task space richer and the policy more general. The endpoint is a policy that takes arbitrary language as its goal and can do anything expressible in natural language. We're not there yet — but the framework is in place.