From MDPs to real-world policy learning — how robots learn from trial and error, and why it's harder than it sounds.
Imagine you want a robot arm to pick up a mug. You could hard-code the trajectory: move to (x, y, z), close gripper, lift. But what happens when the mug is 3 cm to the left? Or wet? Or a different shape? Hard-coding shatters. You need the robot to learn.
Reinforcement learning gives us a framework where the robot discovers behavior through trial and error. It tries an action, observes what happens, collects a reward (or punishment), and adjusts. Over thousands of trials, it converges on a policy — a mapping from observations to actions — that solves the task.
For a robotic arm, the state might be the seven joint angles plus the positions of objects on the table. The actions are the torques applied to each joint motor. The reward is +1 when the mug is lifted, 0 otherwise. The agent's job: figure out which torques, applied in which sequence of states, lead to that +1.
This is fundamentally different from supervised learning. There is no dataset of (state, correct_action) pairs. The robot must explore the space of possibilities and learn from consequences. That's what makes RL both powerful and painfully sample-inefficient.
A robot arm (blue) must reach the target (green). Click Step to take random actions, Train to run Q-learning, or Reset to start over. Watch how the policy improves.
The informal "loop" from Chapter 0 needs mathematical teeth. We formalize it as a Markov Decision Process (MDP). The tutorial by Capuano et al. defines it with seven components:
Let's unpack each element in a robotics context. Suppose we're controlling a 7-DOF robot arm reaching for a target.
The agent's goal: find a policy π(a|s) that maximizes the expected sum of discounted rewards over the horizon. That sum is called the return:
where τ = (s0, a0, s1, a1, …) is a trajectory. The objective is:
A policy π(a|s) maps states to actions (or distributions over actions). It's the robot's "brain" — the rule it follows. A deterministic policy picks one action per state: a = π(s). A stochastic policy samples: a ~ π(·|s).
But how do we know if a policy is good? We need to measure the expected total reward from any given state. That's the value function.
The expected return starting from state s and following policy π thereafter:
The expected return starting from state s, taking action a first, then following π:
Value functions satisfy recursive relationships. The value of a state equals the immediate reward plus the discounted value of the next state:
This is the Bellman equation. It says: the value of being in state s under policy π is the expected immediate reward plus the discounted value of wherever you end up. It's the fundamental building block of almost all RL algorithms.
For the optimal policy π*, the Bellman optimality equation replaces the sum over actions with a max:
We now know what value functions are. But how do we learn them without knowing the dynamics D? Enter Q-learning — one of the foundational model-free RL algorithms.
The idea: maintain a table Q(s, a) for every state-action pair. After each transition (s, a, r, s'), update the table toward the Bellman optimality target:
where the TD error (temporal-difference error) is:
Read it this way: δ is the surprise. The target r + γ max Q(s', a') is what we should have expected. Q(s, a) is what we did expect. The difference, scaled by learning rate α, nudges Q toward the correct value.
Tabular Q-learning stores one value per (s, a) pair. For a 7-DOF arm with continuous states, the table would be infinite. Deep Q-Network (DQN) replaces the table with a neural network Qθ(s, a) that generalizes across states.
Two key innovations make DQN stable:
# DQN training loop (PyTorch-style pseudocode) import torch import random from collections import deque # Initialize Q-network and target network Q = QNetwork(state_dim, action_dim) # [B, state_dim] → [B, action_dim] Q_target = QNetwork(state_dim, action_dim) Q_target.load_state_dict(Q.state_dict()) # copy weights replay_buffer = deque(maxlen=100_000) gamma, lr, epsilon = 0.99, 1e-3, 1.0 optimizer = torch.optim.Adam(Q.parameters(), lr=lr) for episode in range(10_000): state = env.reset() # [state_dim] for step in range(200): # Epsilon-greedy action selection if random.random() < epsilon: action = env.action_space.sample() else: action = Q(state).argmax().item() next_state, reward, done, _ = env.step(action) replay_buffer.append((state, action, reward, next_state, done)) # Sample mini-batch and update if len(replay_buffer) >= 64: batch = random.sample(replay_buffer, 64) s, a, r, s2, d = zip(*batch) target = r + gamma * Q_target(s2).max(dim=1).values * (~d) pred = Q(s).gather(1, a) # Q(s, a) loss = ((pred - target) ** 2).mean() optimizer.zero_grad(); loss.backward(); optimizer.step() state = next_state if done: break # Update target network every 10 episodes if episode % 10 == 0: Q_target.load_state_dict(Q.state_dict()) epsilon = max(0.01, epsilon * 0.995) # decay exploration
Watch Q-values update as the agent learns in a 5×5 grid. Brighter = higher value. Click Train to run 200 episodes of Q-learning. The goal is bottom-right.
DQN works beautifully for discrete actions (go left, go right, fire). But robot joints don't work that way. A torque is a continuous number. You can't enumerate all possible values of τ ∈ [-10, +10] Nm and take a max.
Two families of algorithms solve this: deterministic policy gradients and maximum entropy RL.
Instead of learning Q and then maximizing over actions, learn both a Q-network (critic) and a deterministic policy network (actor) μθ(s) that directly outputs the action.
The actor is trained by ascending the gradient of Q with respect to the policy parameters:
Read this as: "adjust the actor's parameters to make it output actions that increase the critic's Q-value." The critic is trained with the standard TD error, using a target actor μθ' and target critic Qφ'.
DDPG is brittle. Small hyperparameter changes can tank performance. Soft Actor-Critic (SAC) fixes this with a maximum entropy objective:
where H(π) = -E[log π(a|s)] is the entropy of the policy. The temperature α controls the tradeoff: high α encourages exploration (high entropy), low α favors exploitation (greedy).
The soft Bellman equation incorporates entropy:
SAC maintains five neural networks: two Q-networks (to reduce overestimation), two target Q-networks, and one stochastic policy πθ(a|s). The policy outputs a distribution over actions, typically a squashed Gaussian: sample a ~ tanh(N(μθ(s), σθ(s)2)).
The soft Q-update uses the minimum of the two Q-networks (clipped double-Q trick):
The temperature α is auto-tuned to maintain a target entropy H̄ (typically -dim(A)). This means SAC has one fewer hyperparameter than DDPG — the entropy weight tunes itself.
RL algorithms look elegant on paper. In simulation, they achieve superhuman Atari scores. But putting them on a real robot reveals brutal problems that no amount of algorithmic cleverness fully solves.
RL learns by exploring — and exploration means trying bad actions. A simulated robot can crash into a wall 10,000 times at zero cost. A real robot arm flailing random torques can destroy itself, its surroundings, or hurt a human. Every exploratory action on a real robot carries physical risk.
SAC on a simulated reaching task might need 1 million environment steps to converge. At 10 Hz control, that's 100,000 seconds ≈ 28 hours of continuous operation. Factor in resets, failures, and human supervision, and a single training run can take weeks.
Compare this to human learning: a child learns to grasp in a few hundred attempts. The gap is enormous, and it's driven by our lack of priors, structure, and world models in vanilla RL.
RL training assumes episodes: the task resets after each attempt. In simulation, this is a function call. On a real robot, it means a human walking over, picking up the dropped mug, placing it back, and pressing "go." Resets are the silent killer of real-world RL — each one takes 30-60 seconds and requires human labor.
Train in simulation, deploy on real hardware. Sounds ideal. But simulated physics is always wrong: friction coefficients are approximate, contact dynamics are simplified, and sensor noise is undermodeled. A policy that works perfectly in MuJoCo often fails completely on the real robot. This is the sim-to-real gap.
| Challenge | In Simulation | On Real Robot |
|---|---|---|
| Cost per failure | 0 (free reset) | Potential hardware damage ($$$) |
| Steps per second | 10,000+ (accelerated) | 10-50 (real-time) |
| Reset time | < 1 ms | 30-60 seconds (human labor) |
| Parallelism | 1000+ envs on one GPU | 1 robot = 1 env |
| 1M steps takes | Minutes | Weeks |
This table explains why virtually all robot RL research starts in simulation. The question is how to transfer what's learned there to the real world.
If simulation is always wrong, why not make it wrong in many different ways? That's domain randomization (DR). During training in simulation, we randomly vary physical and visual parameters: friction coefficients, object masses, lighting conditions, camera angles, actuator delays. The policy must handle all of them, so it learns to be robust.
Physics parameters: friction, damping, mass, center of mass, actuator gains, joint limits, control delay. Visual parameters: lighting direction, color, texture, camera pose, distractor objects.
AutoDR (Automatic Domain Randomization) starts with narrow parameter ranges and progressively widens them as the policy improves. If the policy achieves high reward with friction ∈ [0.4, 0.6], expand to [0.3, 0.7]. This curriculum prevents the policy from being overwhelmed by too much variation early on.
DORAEMON (Domain Randomization with Active Exploration and Model Optimization) combines DR with active system identification: it collects real-world data to estimate which simulator parameters are most likely, then concentrates randomization around that estimate. This focuses the DR budget on the parameters that matter most.
Left: policy trained with fixed friction μ=0.5 (narrow). Right: policy trained with DR μ∈[0.2, 1.2] (wide). The vertical line shows real-world friction. Click Randomize to resample the real-world value.
What if we could have the best of both worlds? RL's ability to optimize beyond human performance, combined with human demonstrations to bootstrap learning and avoid dangerous exploration? That's HIL-SERL (Human-in-the-Loop Sample-Efficient Reinforcement Learning).
The LeRobot framework provides a clean interface for this kind of training. Here's the key configuration for SAC-based policy learning:
# LeRobot RL training configuration (simplified) from lerobot.scripts.train import train # Policy: SAC with image observations policy_config = { "type": "sac", "actor_lr": 3e-4, "critic_lr": 3e-4, "alpha_lr": 3e-4, # entropy temperature (auto-tuned) "gamma": 0.99, "tau": 0.005, # target network soft update rate "batch_size": 256, "replay_buffer_size": 1_000_000, } # Load human demos into replay buffer demo_config = { "demo_path": "data/pcb_insertion_demos/", "num_demos": 50, "demo_sampling_ratio": 0.25, # 25% of each batch from demos } # Train on real robot for 10k steps (~1 hour at 10Hz) train( policy=policy_config, env="real_robot_env", demos=demo_config, total_steps=10_000, eval_freq=500, )
We've talked about rewards as if they're given. In practice, designing the reward function is one of the hardest parts of robot RL. The reward must capture what you want without encoding what you think the solution should look like.
Sparse: r = +1 when the task is complete, 0 otherwise. "Did you fold the shirt?" The agent gets no guidance until it stumbles onto success, which may take millions of trials.
Dense: r = -||pee - ptarget||2 at every step. The agent gets continuous feedback. It learns much faster, but you must be careful: a poorly shaped dense reward can lead to unintended behavior.
Compare how quickly an agent learns with dense vs. sparse reward. Click Run to simulate 500 episodes of training under each scheme.
Instead of manually specifying the reward, learn it from demonstrations. Train a classifier or regressor on expert data: "Given observation o, how close is this to task completion?" This is the approach used in HIL-SERL (Chapter 7) and in inverse RL methods.
RL for robotics is a powerful paradigm: define a reward, let the robot optimize it through experience. But we've seen the costs: millions of samples, safety risks, reward engineering headaches, and the sim-to-real gap.
These costs motivate a natural question: what if we could skip exploration entirely? What if, instead of letting the robot discover behavior through trial and error, we simply showed it what to do?
This is behavioral cloning (BC) and more broadly imitation learning — the subject of Chapter 4. Instead of maximizing a reward, BC minimizes the error between the robot's predicted action and the expert's demonstrated action: minθ E[||πθ(o) - aexpert||2].
But BC has its own problems: compounding errors, multimodal action distributions, and the inability to improve beyond the demonstrator. Chapter 4 will show how modern generative models — VAEs, diffusion, flow matching — solve these issues.