Capuano et al., Chapter 3

Robot Reinforcement Learning

From MDPs to real-world policy learning — how robots learn from trial and error, and why it's harder than it sounds.

Prerequisites: Chapter 2 (Sensors & Actuators). You know what a robot arm is — now we make it learn.
10
Chapters
4+
Simulations
10
Quizzes

Chapter 0: Why RL for Robotics?

Imagine you want a robot arm to pick up a mug. You could hard-code the trajectory: move to (x, y, z), close gripper, lift. But what happens when the mug is 3 cm to the left? Or wet? Or a different shape? Hard-coding shatters. You need the robot to learn.

Reinforcement learning gives us a framework where the robot discovers behavior through trial and error. It tries an action, observes what happens, collects a reward (or punishment), and adjusts. Over thousands of trials, it converges on a policy — a mapping from observations to actions — that solves the task.

The core loop: The robot (agent) lives inside a world (environment). At each time step it sees a state, picks an action, gets a reward, and lands in a new state. This loop is the heartbeat of all RL.

For a robotic arm, the state might be the seven joint angles plus the positions of objects on the table. The actions are the torques applied to each joint motor. The reward is +1 when the mug is lifted, 0 otherwise. The agent's job: figure out which torques, applied in which sequence of states, lead to that +1.

This is fundamentally different from supervised learning. There is no dataset of (state, correct_action) pairs. The robot must explore the space of possibilities and learn from consequences. That's what makes RL both powerful and painfully sample-inefficient.

Interactive Grid World

A robot arm (blue) must reach the target (green). Click Step to take random actions, Train to run Q-learning, or Reset to start over. Watch how the policy improves.

Episode: 0 | Steps: 0
State st
Joint angles, object positions
Action at
Joint torques
Environment
Physics simulation / real world
Reward rt
+1 for mug lifted, 0 otherwise
↻ repeat
Why is reinforcement learning preferred over hard-coded trajectories for robotic manipulation?

Chapter 1: The MDP Formulation

The informal "loop" from Chapter 0 needs mathematical teeth. We formalize it as a Markov Decision Process (MDP). The tutorial by Capuano et al. defines it with seven components:

M = ⟨ S, A, D, r, γ, ρ, T ⟩

Let's unpack each element in a robotics context. Suppose we're controlling a 7-DOF robot arm reaching for a target.

S — State space. The set of all possible states. For our arm: s = (q1, …, q7, q̇1, …, q̇7, xobj, yobj, zobj). That's 7 joint angles + 7 joint velocities + 3D object position = 17-dimensional continuous space.
A — Action space. All possible actions. For torque control: a = (τ1, …, τ7) ∈ ℝ7, each torque bounded by the motor's physical limits. For position control: a = (Δq1, …, Δq7), desired joint position changes.
D — Dynamics. D(s' | s, a) — the probability of transitioning to state s' given state s and action a. In simulation this is the physics engine. In the real world, it's nature. The Markov property says: the future state depends only on the current state and action, not on history.
r — Reward function. r(s, a, s') ∈ ℝ — a scalar signal telling the agent how good the transition was. Example: r = -||pee - ptarget||2 (negative distance from end-effector to target).
γ — Discount factor. γ ∈ [0, 1). Determines how much future rewards are worth compared to immediate ones. γ = 0.99 means a reward 100 steps from now is worth 0.99100 ≈ 0.37 of a reward right now.
ρ — Initial state distribution. ρ(s0) — where episodes start. For our arm, this might be "arm at home position, object placed randomly on the table."
T — Horizon. How long each episode lasts. T = 100 means the robot gets 100 time steps per attempt. Could be infinite for continuing tasks.

The agent's goal: find a policy π(a|s) that maximizes the expected sum of discounted rewards over the horizon. That sum is called the return:

G(τ) = ∑t=0T γt r(st, at, st+1)

where τ = (s0, a0, s1, a1, …) is a trajectory. The objective is:

J(π) = Eτ ~ π[ G(τ) ]
Worked example. A 3-joint arm has states s = (q1, q2, q3) and actions a = (τ1, τ2, τ3). With γ = 0.9 and T = 3 steps, if the rewards are r0 = -2, r1 = -1, r2 = +5:

G = γ0(-2) + γ1(-1) + γ2(+5) = -2 + 0.9(-1) + 0.81(5) = -2 - 0.9 + 4.05 = 1.15

Even though early rewards were negative, the large final reward (reaching the target) made the total return positive.
In the MDP tuple M = ⟨S, A, D, r, γ, ρ, T⟩, what does D represent?

Chapter 2: Policies & Value Functions

A policy π(a|s) maps states to actions (or distributions over actions). It's the robot's "brain" — the rule it follows. A deterministic policy picks one action per state: a = π(s). A stochastic policy samples: a ~ π(·|s).

But how do we know if a policy is good? We need to measure the expected total reward from any given state. That's the value function.

State-Value Function Vπ(s)

The expected return starting from state s and following policy π thereafter:

Vπ(s) = Eπ[ Gt | St = s ] = Eπ[ ∑k=0 γk Rt+k+1 | St = s ]

Action-Value Function Qπ(s, a)

The expected return starting from state s, taking action a first, then following π:

Qπ(s, a) = Eπ[ Gt | St = s, At = a ]
V tells you how good a state is. Q tells you how good a state-action pair is. If you know Q for every (s, a), you can extract the optimal policy by always picking the action with the highest Q: π*(s) = argmaxa Q*(s, a).

Bellman Equations

Value functions satisfy recursive relationships. The value of a state equals the immediate reward plus the discounted value of the next state:

Vπ(s) = ∑a π(a|s) ∑s' D(s'|s,a) [ r(s,a,s') + γ Vπ(s') ]

This is the Bellman equation. It says: the value of being in state s under policy π is the expected immediate reward plus the discounted value of wherever you end up. It's the fundamental building block of almost all RL algorithms.

Worked example: 3-state robot. States: {Start, Middle, Goal}. Policy: always go right. Transitions are deterministic. γ = 0.9.

r(Start, right, Middle) = -1,   r(Middle, right, Goal) = +10,   V(Goal) = 0

Working backward:
V(Middle) = r(Middle, right, Goal) + γ · V(Goal) = 10 + 0.9(0) = 10
V(Start) = r(Start, right, Middle) + γ · V(Middle) = -1 + 0.9(10) = 8

Even though the first step has negative reward, starting at "Start" is still valuable (V = 8) because the goal ahead is lucrative.

For the optimal policy π*, the Bellman optimality equation replaces the sum over actions with a max:

V*(s) = maxas' D(s'|s,a) [ r(s,a,s') + γ V*(s') ]
Q*(s, a) = ∑s' D(s'|s,a) [ r(s,a,s') + γ maxa' Q*(s', a') ]
In the worked example, why is V(Start) = 8 even though the immediate reward is -1?

Chapter 3: Q-Learning & DQN

We now know what value functions are. But how do we learn them without knowing the dynamics D? Enter Q-learning — one of the foundational model-free RL algorithms.

The idea: maintain a table Q(s, a) for every state-action pair. After each transition (s, a, r, s'), update the table toward the Bellman optimality target:

Q(s, a) ← Q(s, a) + α · δ

where the TD error (temporal-difference error) is:

δ = r + γ · maxa' Q(s', a') - Q(s, a)

Read it this way: δ is the surprise. The target r + γ max Q(s', a') is what we should have expected. Q(s, a) is what we did expect. The difference, scaled by learning rate α, nudges Q toward the correct value.

TD error = surprise. When δ > 0, the transition was better than expected — so we increase Q(s, a). When δ < 0, it was worse — we decrease Q(s, a). Over time, Q converges to Q* and the greedy policy over Q* is optimal.
Worked example. State s = "joint at 45°", action a = "apply +2 Nm torque". Current Q(s, a) = 3.0. After taking the action: reward r = -1, next state s' where maxa' Q(s', a') = 6.0. γ = 0.9, α = 0.1.

δ = (-1) + 0.9(6.0) - 3.0 = -1 + 5.4 - 3.0 = 1.4
Q(s, a) ← 3.0 + 0.1(1.4) = 3.14

The transition was better than expected (δ > 0), so Q increased slightly.

From Tables to Neural Networks: DQN

Tabular Q-learning stores one value per (s, a) pair. For a 7-DOF arm with continuous states, the table would be infinite. Deep Q-Network (DQN) replaces the table with a neural network Qθ(s, a) that generalizes across states.

Two key innovations make DQN stable:

Experience replay. Store transitions (s, a, r, s') in a buffer. Sample mini-batches randomly for training. This breaks temporal correlations that destabilize SGD.
Target network. Keep a separate, slowly-updated copy Qθ' for computing targets. Update rule: minimize (r + γ maxa' Qθ'(s', a') - Qθ(s, a))2. Periodically copy θ → θ'. This prevents the "moving target" problem.

DQN Training Algorithm

# DQN training loop (PyTorch-style pseudocode)
import torch
import random
from collections import deque

# Initialize Q-network and target network
Q = QNetwork(state_dim, action_dim)      # [B, state_dim] → [B, action_dim]
Q_target = QNetwork(state_dim, action_dim)
Q_target.load_state_dict(Q.state_dict())  # copy weights

replay_buffer = deque(maxlen=100_000)
gamma, lr, epsilon = 0.99, 1e-3, 1.0
optimizer = torch.optim.Adam(Q.parameters(), lr=lr)

for episode in range(10_000):
    state = env.reset()                   # [state_dim]
    for step in range(200):
        # Epsilon-greedy action selection
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = Q(state).argmax().item()

        next_state, reward, done, _ = env.step(action)
        replay_buffer.append((state, action, reward, next_state, done))

        # Sample mini-batch and update
        if len(replay_buffer) >= 64:
            batch = random.sample(replay_buffer, 64)
            s, a, r, s2, d = zip(*batch)
            target = r + gamma * Q_target(s2).max(dim=1).values * (~d)
            pred = Q(s).gather(1, a)   # Q(s, a)
            loss = ((pred - target) ** 2).mean()
            optimizer.zero_grad(); loss.backward(); optimizer.step()

        state = next_state
        if done: break

    # Update target network every 10 episodes
    if episode % 10 == 0:
        Q_target.load_state_dict(Q.state_dict())
    epsilon = max(0.01, epsilon * 0.995)  # decay exploration
The epsilon schedule matters. Start with ε = 1.0 (pure exploration) and decay to 0.01 (mostly exploitation). Too fast and the agent never explores enough. Too slow and it wastes time on random actions. The 0.995 decay factor gives roughly 1000 episodes of meaningful exploration.
Q-Value Heatmap

Watch Q-values update as the agent learns in a 5×5 grid. Brighter = higher value. Click Train to run 200 episodes of Q-learning. The goal is bottom-right.

Q-values initialized to 0
What does the TD error δ represent?

Chapter 4: Continuous Control — DDPG & SAC

DQN works beautifully for discrete actions (go left, go right, fire). But robot joints don't work that way. A torque is a continuous number. You can't enumerate all possible values of τ ∈ [-10, +10] Nm and take a max.

Two families of algorithms solve this: deterministic policy gradients and maximum entropy RL.

DDPG: Deterministic Policy Gradient

Instead of learning Q and then maximizing over actions, learn both a Q-network (critic) and a deterministic policy network (actor) μθ(s) that directly outputs the action.

The actor is trained by ascending the gradient of Q with respect to the policy parameters:

θ J ≈ Es ~ D[ ∇a Qφ(s, a)|a=μθ(s) · ∇θ μθ(s) ]

Read this as: "adjust the actor's parameters to make it output actions that increase the critic's Q-value." The critic is trained with the standard TD error, using a target actor μθ' and target critic Qφ'.

DDPG = DQN for continuous actions. It inherits experience replay and target networks from DQN, adds an actor network that replaces the argmax, and uses Ornstein-Uhlenbeck noise for exploration.

SAC: Soft Actor-Critic

DDPG is brittle. Small hyperparameter changes can tank performance. Soft Actor-Critic (SAC) fixes this with a maximum entropy objective:

J(π) = ∑t=0T E(st,at) ~ π[ r(st, at) + α H(π(·|st)) ]

where H(π) = -E[log π(a|s)] is the entropy of the policy. The temperature α controls the tradeoff: high α encourages exploration (high entropy), low α favors exploitation (greedy).

The soft Bellman equation incorporates entropy:

Qπ(s, a) = r(s, a) + γ Es'[ Vπ(s') ]
Vπ(s) = Ea ~ π[ Qπ(s, a) - α log π(a|s) ]
Why entropy matters for robots. A robot exploring with DDPG might commit to one strategy early and never discover better ones. SAC's entropy bonus keeps the policy "spread out" during training — trying diverse approaches — while still converging to near-optimal behavior. This makes SAC far more robust to hyperparameter choices and random seeds.
SAC = "be as random as possible while still being good." Among all policies that achieve the same expected reward, SAC picks the one with maximum entropy. This isn't just an exploration trick — it leads to more robust policies that transfer better to new situations.

SAC Training Details

SAC maintains five neural networks: two Q-networks (to reduce overestimation), two target Q-networks, and one stochastic policy πθ(a|s). The policy outputs a distribution over actions, typically a squashed Gaussian: sample a ~ tanh(N(μθ(s), σθ(s)2)).

The soft Q-update uses the minimum of the two Q-networks (clipped double-Q trick):

y = r + γ ( mini=1,2 Qφ'i(s', ã) - α log πθ(ã|s') ),   ã ~ πθ(·|s')

The temperature α is auto-tuned to maintain a target entropy H̄ (typically -dim(A)). This means SAC has one fewer hyperparameter than DDPG — the entropy weight tunes itself.

Worked example: soft Q-target.

State s, action a. Reward r = 2.0. Next state s'. γ = 0.99. α = 0.2.
Sample next action ã ~ π(·|s'). log π(ã|s') = -1.5. Q1(s', ã) = 8.0. Q2(s', ã) = 8.3.

y = 2.0 + 0.99 · (min(8.0, 8.3) - 0.2 · (-1.5))
  = 2.0 + 0.99 · (8.0 + 0.3)
  = 2.0 + 0.99 · 8.3
  = 2.0 + 8.217 = 10.217

The entropy bonus (+0.3) adds value for being in a state where diverse actions are possible. This encourages the policy to keep its options open.
What problem does SAC's maximum entropy objective solve compared to DDPG?

Chapter 5: Real-World RL Challenges

RL algorithms look elegant on paper. In simulation, they achieve superhuman Atari scores. But putting them on a real robot reveals brutal problems that no amount of algorithmic cleverness fully solves.

Challenge 1: Safety

RL learns by exploring — and exploration means trying bad actions. A simulated robot can crash into a wall 10,000 times at zero cost. A real robot arm flailing random torques can destroy itself, its surroundings, or hurt a human. Every exploratory action on a real robot carries physical risk.

The exploration-safety tension. RL needs to try new things to learn. But on a real robot, "new things" might mean slamming the gripper into the table at full speed. Constraining exploration too much prevents learning; too little risks damage. There is no clean solution.

Challenge 2: Sample Efficiency

SAC on a simulated reaching task might need 1 million environment steps to converge. At 10 Hz control, that's 100,000 seconds ≈ 28 hours of continuous operation. Factor in resets, failures, and human supervision, and a single training run can take weeks.

Compare this to human learning: a child learns to grasp in a few hundred attempts. The gap is enormous, and it's driven by our lack of priors, structure, and world models in vanilla RL.

Challenge 3: Environment Resets

RL training assumes episodes: the task resets after each attempt. In simulation, this is a function call. On a real robot, it means a human walking over, picking up the dropped mug, placing it back, and pressing "go." Resets are the silent killer of real-world RL — each one takes 30-60 seconds and requires human labor.

Challenge 4: The Reality Gap

Train in simulation, deploy on real hardware. Sounds ideal. But simulated physics is always wrong: friction coefficients are approximate, contact dynamics are simplified, and sensor noise is undermodeled. A policy that works perfectly in MuJoCo often fails completely on the real robot. This is the sim-to-real gap.

Simulation is a lie, but a useful one. The goal isn't perfect simulation — it's making the policy robust enough that it works despite the sim-to-real mismatch. Domain randomization (next chapter) is the key technique for bridging this gap.

The Numbers Are Sobering

ChallengeIn SimulationOn Real Robot
Cost per failure0 (free reset)Potential hardware damage ($$$)
Steps per second10,000+ (accelerated)10-50 (real-time)
Reset time< 1 ms30-60 seconds (human labor)
Parallelism1000+ envs on one GPU1 robot = 1 env
1M steps takesMinutesWeeks

This table explains why virtually all robot RL research starts in simulation. The question is how to transfer what's learned there to the real world.

Which of these is NOT a major challenge for deploying RL on real robots?

Chapter 6: Domain Randomization

If simulation is always wrong, why not make it wrong in many different ways? That's domain randomization (DR). During training in simulation, we randomly vary physical and visual parameters: friction coefficients, object masses, lighting conditions, camera angles, actuator delays. The policy must handle all of them, so it learns to be robust.

The DR philosophy: Instead of trying to make simulation match reality (impossible), make simulation cover a range that includes reality. If the policy works under friction ∈ [0.2, 1.5], and the real friction is 0.8, we're covered.

What Gets Randomized?

Physics parameters: friction, damping, mass, center of mass, actuator gains, joint limits, control delay. Visual parameters: lighting direction, color, texture, camera pose, distractor objects.

Worked example: How DR widens the training distribution.

Without DR: The policy trains with friction μ = 0.5 (simulation default). It learns a grasping force calibrated exactly for μ = 0.5. On the real robot where μ = 0.7, the grasp is too forceful and crushes soft objects.

With DR: μ ~ Uniform(0.2, 1.2) during training. The policy sees μ = 0.3 (slippery, needs strong grip) and μ = 1.1 (sticky, gentle grip suffices). It learns an adaptive strategy that adjusts grip force based on tactile feedback. On the real robot with μ = 0.7, it generalizes.

The training distribution PDR(μ) = Uniform(0.2, 1.2) has support [0.2, 1.2] which contains the real value 0.7. The non-DR distribution is a point mass at 0.5 — no margin for error.

AutoDR and DORAEMON

AutoDR (Automatic Domain Randomization) starts with narrow parameter ranges and progressively widens them as the policy improves. If the policy achieves high reward with friction ∈ [0.4, 0.6], expand to [0.3, 0.7]. This curriculum prevents the policy from being overwhelmed by too much variation early on.

DORAEMON (Domain Randomization with Active Exploration and Model Optimization) combines DR with active system identification: it collects real-world data to estimate which simulator parameters are most likely, then concentrates randomization around that estimate. This focuses the DR budget on the parameters that matter most.

The key insight: Uniform randomization is wasteful. Most of the parameter space is far from reality. Smart DR methods (AutoDR, DORAEMON) allocate randomization budget where it helps most — near the real parameters, or in the dimensions that affect policy behavior most strongly.
Domain Randomization vs. Fixed Simulation

Left: policy trained with fixed friction μ=0.5 (narrow). Right: policy trained with DR μ∈[0.2, 1.2] (wide). The vertical line shows real-world friction. Click Randomize to resample the real-world value.

What is the core idea behind domain randomization?

Chapter 7: HIL-SERL — Human-in-the-Loop RL

What if we could have the best of both worlds? RL's ability to optimize beyond human performance, combined with human demonstrations to bootstrap learning and avoid dangerous exploration? That's HIL-SERL (Human-in-the-Loop Sample-Efficient Reinforcement Learning).

The Three Ingredients

1. Expert demonstrations in the replay buffer. Before RL training starts, a human teleoperates the robot through 20-50 successful demonstrations. These are stored in the replay buffer alongside RL-collected data. The agent samples from both, getting a head start on what "good" looks like.
2. Learned reward functions. Instead of hand-engineering a reward, train a classifier from the demonstrations: "does this frame look like a successful grasp?" The classifier score becomes the reward signal, providing dense feedback without manual reward shaping.
3. Human interventions during training. If the robot is about to do something dangerous (e.g., pushing objects off the table), the human takes over via teleoperation. The intervention data is added to the replay buffer (it's good data!) and the agent resumes autonomy. This solves the safety problem: the human acts as a guardrail.
The result: HIL-SERL achieves near-perfect performance on real-world tasks like PCB insertion, cable routing, and object handover in just 1-2 hours of total training time. Compare this to pure RL, which might need days or weeks on the same tasks (if it converges at all).

LeRobot RL Training Setup

The LeRobot framework provides a clean interface for this kind of training. Here's the key configuration for SAC-based policy learning:

# LeRobot RL training configuration (simplified)
from lerobot.scripts.train import train

# Policy: SAC with image observations
policy_config = {
    "type": "sac",
    "actor_lr": 3e-4,
    "critic_lr": 3e-4,
    "alpha_lr": 3e-4,        # entropy temperature (auto-tuned)
    "gamma": 0.99,
    "tau": 0.005,            # target network soft update rate
    "batch_size": 256,
    "replay_buffer_size": 1_000_000,
}

# Load human demos into replay buffer
demo_config = {
    "demo_path": "data/pcb_insertion_demos/",
    "num_demos": 50,
    "demo_sampling_ratio": 0.25,  # 25% of each batch from demos
}

# Train on real robot for 10k steps (~1 hour at 10Hz)
train(
    policy=policy_config,
    env="real_robot_env",
    demos=demo_config,
    total_steps=10_000,
    eval_freq=500,
)
Demo sampling ratio is critical. Setting demo_sampling_ratio = 0.25 means each training batch is 25% expert data and 75% RL-collected data. Too high and the agent never learns beyond the demos. Too low and it wastes the head start. 0.25 is a good default from the HIL-SERL paper.
How does HIL-SERL achieve near-perfect performance in just 1-2 hours?

Chapter 8: Reward Design

We've talked about rewards as if they're given. In practice, designing the reward function is one of the hardest parts of robot RL. The reward must capture what you want without encoding what you think the solution should look like.

Dense vs. Sparse Rewards

Sparse: r = +1 when the task is complete, 0 otherwise. "Did you fold the shirt?" The agent gets no guidance until it stumbles onto success, which may take millions of trials.

Dense: r = -||pee - ptarget||2 at every step. The agent gets continuous feedback. It learns much faster, but you must be careful: a poorly shaped dense reward can lead to unintended behavior.

The reward shaping trap. You want the robot to fold a shirt. You add a dense reward for "fabric area decreasing." The robot learns to crumple the shirt into a ball — area decreased! — instead of folding it. The reward was technically optimized, but the behavior is wrong. This is reward hacking.
Dense vs. Sparse Learning Curves

Compare how quickly an agent learns with dense vs. sparse reward. Click Run to simulate 500 episodes of training under each scheme.

Learned Rewards

Instead of manually specifying the reward, learn it from demonstrations. Train a classifier or regressor on expert data: "Given observation o, how close is this to task completion?" This is the approach used in HIL-SERL (Chapter 7) and in inverse RL methods.

The spectrum of reward design:
Sparse → hard to learn, but hard to hack
Dense hand-shaped → fast learning, but prone to reward hacking
Learned from demonstrations → captures human intent, but requires demo data

Most practical systems use a combination: a sparse task-completion reward plus a learned "progress" reward from demonstrations.
Why can dense reward shaping lead to unintended behaviors?

Chapter 9: Connections

RL for robotics is a powerful paradigm: define a reward, let the robot optimize it through experience. But we've seen the costs: millions of samples, safety risks, reward engineering headaches, and the sim-to-real gap.

These costs motivate a natural question: what if we could skip exploration entirely? What if, instead of letting the robot discover behavior through trial and error, we simply showed it what to do?

RL's fundamental tradeoff: RL can discover novel, superhuman strategies. But it pays for this capability with enormous data requirements and safety concerns. For many practical tasks — folding laundry, inserting a PCB, packing a box — a human can demonstrate the solution in 5 minutes. Why not just imitate that demonstration?

This is behavioral cloning (BC) and more broadly imitation learning — the subject of Chapter 4. Instead of maximizing a reward, BC minimizes the error between the robot's predicted action and the expert's demonstrated action: minθ E[||πθ(o) - aexpert||2].

But BC has its own problems: compounding errors, multimodal action distributions, and the inability to improve beyond the demonstrator. Chapter 4 will show how modern generative models — VAEs, diffusion, flow matching — solve these issues.

Chapter 3 (this chapter)
RL: optimize reward through exploration. Powerful but data-hungry.
Chapter 4 (next)
Imitation: learn from demonstrations. Efficient but limited by demo quality.
Chapter 5 (later)
VLAs: combine language + vision + action for general-purpose robots.
The hybrid approach: The most effective real-world systems (like HIL-SERL from Chapter 7) combine both paradigms: imitation to bootstrap the policy, RL to fine-tune beyond human performance. Neither alone is sufficient for general-purpose robotics.
What is the main advantage of imitation learning (Chapter 4) over RL (this chapter)?
Ch 4: Imitation Learning →