Tan, Zhang, Coumans, Iscen, Bai, Hafner, Bohez, Vanhoucke — 2018

Sim-to-Real: Agile Locomotion

How Google Brain trained a quadruped robot to gallop and trot using deep RL in simulation — then deployed the policies on real hardware with zero fine-tuning, by combining accurate actuator models, latency simulation, and domain randomization.

Prerequisites: RL basics + PPO + PD control
10
Chapters
5+
Simulations

Chapter 0: The Problem

Making a four-legged robot run is one of robotics' oldest headaches. The robot is under-actuated — it has fewer motors than degrees of freedom — and every stride involves split-second contact switches between feet and ground. Classical approaches require an expert to hand-tune gait parameters for weeks. One wrong coefficient and the robot face-plants.

Deep RL seemed like the answer: let an algorithm discover locomotion from scratch. By 2018, simulated agents had learned impressive gaits — bipeds running, dogs navigating terrains. But every one of these successes lived inside a simulator. The moment you uploaded the policy to a real robot, it failed.

Why? Because the simulator lies. It models motors as perfect torque sources (they aren't). It ignores latency (real sensors are delayed). It uses wrong friction coefficients (every floor is different). The policy learned to exploit the simulator's quirks rather than the physics of the real world.

The core challenge: Can we train a locomotion policy entirely in simulation and deploy it on a real quadruped robot with zero additional training on hardware? This paper shows the answer is yes — if you make the simulator honest and the policy robust.
The Sim-to-Real Gap

A policy trained in simulation (left) encounters different physics on a real robot (right). The gap between them is the "reality gap." Toggle to see how performance degrades.

Why do locomotion policies trained in simulation typically fail on real robots?

Chapter 1: The Key Insight

The paper's central thesis is that the reality gap can be narrowed from both sides simultaneously:

  1. Make the simulator more accurate — develop a faithful actuator model (not the default idealized one), simulate latency, and measure real physical parameters.
  2. Make the policy more robust — randomize physics parameters during training so the policy can't overfit to any single simulation configuration.
  3. Keep the policy compact — a small observation space and small network leave less room for overfitting to simulated artifacts.
The intuition in one sentence: If you train across thousands of slightly different simulators (each with different mass, friction, latency, motor strength), and your simulator is already close to reality, then the real world is just "another sample" from the distribution the policy has already mastered.

The combination works because accurate simulation gets you close to reality (reducing the gap from "impossible" to "small"), and domain randomization makes the policy tolerant of whatever residual gap remains. Neither alone is sufficient. Without accurate simulation, randomization ranges would need to be enormous, producing overly conservative gaits. Without randomization, even a good simulator has small discrepancies that compound over hundreds of control steps.

The result: a Minitaur quadruped that gallops at 1.18 m/s (2.18 body lengths per second) in the real world, deployed directly from simulation with no fine-tuning. The learned gaits actually consume 23-35% less power than expert-tuned hand-crafted gaits.

Why is neither accurate simulation nor domain randomization sufficient alone for sim-to-real transfer?

Chapter 2: The Reality Gap

Before we can close the gap, we need to understand exactly what causes it. The paper identifies four major sources of mismatch between simulation and reality:

1. Actuator dynamics

The default physics engine (Bullet) models motors using a constraint-based approach: it solves for the torque that makes the motor hit its target angle by the end of the timestep. This is not how real DC motors work. Real motors have limited torque, nonlinear torque-current curves, and back-EMF that reduces torque at high speeds. The simulated motors are effectively infinitely strong and perfectly responsive.

2. Latency

In the simulator, commands take effect instantly and sensors report back immediately. On the real Minitaur, there's a 15-19ms delay between sending a command and seeing its effect in sensor readings. The on-board PD controller runs on a microcontroller with ~3ms latency, while the neural network policy runs on an Nvidia Jetson TX2 with higher latency. This delay shrinks the stability region of any feedback controller.

3. Sensor noise and bias

The simulated IMU gives perfect roll, pitch, and angular velocity readings. The real IMU has bias drift (especially yaw, which they exclude entirely) and Gaussian noise. Motor velocity estimates are noisy because they're computed from finite differences of encoder readings.

4. Contact dynamics

The friction between rubber feet and carpet varies from floor to floor. The contact model in Bullet (LCP-based) is an approximation of real contact mechanics. Ground compliance, foot deformation, and sliding transitions are all simplified.

The compounding effect: Locomotion amplifies every one of these errors. Each stride involves contact switches that break the control space into fragmented regions. A small torque error that wouldn't matter for a stationary arm causes the wrong foot to strike the ground at the wrong time, which cascades into a completely different trajectory within seconds.
Sources of the Reality Gap

Each source contributes error that compounds during locomotion. Click each source to see its impact.

Why does the default Bullet physics engine overestimate motor capability?

Chapter 3: Domain Randomization

The core idea: at the start of each training episode, randomly sample a set of physics parameters from predefined ranges. The policy then trains across thousands of slightly different "worlds," and must learn a controller that works in all of them.

What gets randomized

The paper randomizes 10 physical parameters. The ranges are chosen carefully — too narrow and the policy overfits; too wide and it becomes overly conservative (slow, cautious gaits).

Mass
80% – 120% of measured value
Motor friction
0 – 0.05 Nm
Inertia
50% – 150% (wider range because estimated, not measured)
Motor strength
80% – 120% (accounts for wear and tear)
Control step
3ms – 20ms (non-realtime OS jitter)
Latency
0ms – 40ms
Battery voltage
14.0V – 16.8V (charge-dependent)
Contact friction
0.5 – 1.25 (rubber on carpet range)
IMU bias
±0.05 radians
IMU noise (std)
0 – 0.05 radians

Why these ranges matter

Mass and motor friction have tight ranges (±20%) because they were carefully measured during system identification. Inertia has a wider range (±50%) because it was estimated from shape and mass assuming uniform density — a rough approximation. Contact friction varies between 0.5 and 1.25 to cover different carpet surfaces. The principle: randomize more where you're less certain.

Random perturbations

In addition to parameter randomization, the paper applies random force perturbations to the robot's base every 200 simulation steps (~1.2 seconds). Each perturbation lasts 10 steps (~0.06 seconds), with random direction and magnitude between 130N and 220N. This forces the policy to learn balance recovery from arbitrary disturbances.

The key principle: Domain randomization trades optimality for robustness. The policy can't achieve the absolute maximum speed in any single configuration, but it achieves good speed across all configurations — including the one that happens to be reality.
Domain Randomization at Work

Each dot is one training episode with different randomized parameters. The policy must handle all of them. Drag the slider to adjust randomization range.

Range50%
Why does the paper randomize inertia over a wider range (±50%) than mass (±20%)?

Chapter 4: Simulation Accuracy

Domain randomization can only compensate for so much error. If your simulator is fundamentally wrong — if it models motors as perfect torque sources when real motors saturate — no amount of randomization will fix that. This chapter covers the simulation improvements that close the gap from the other side.

The actuator model

The real Minitaur uses direct-drive DC motors controlled via PWM signals. The torque of a DC motor follows:

τ = Kt I    where    I = (Vpwm − Vemf) / R

Kt is the torque constant, Vemf = Kt q̇ is the back-EMF (voltage generated by a spinning motor that opposes the input), and R is armature resistance. This means torque decreases as the motor spins faster — something the default Bullet model ignores entirely.

But even this isn't enough. The linear torque-current relation τ = KtI only holds for ideal motors. Real motors exhibit torque saturation: at high currents, the torque plateaus rather than continuing to increase linearly. The paper models this with a piecewise-linear function calibrated from motor testing data.

PD control in simulation

The Minitaur uses position control (PD servo) to actuate motors. The default Bullet implementation is:

en+1 = kp(q̄ − qn+1) + kd(q̇̄ − q̇n+1) = 0

This solves for the torque that makes the motor exactly hit its target by the end of the timestep — an implicit constraint that guarantees convergence. Real PD control uses current state, not future state:

Vpwm = V · (kp(q̄ − qn) + kd(q̇̄ − q̇n))

This subtle difference means simulated motors with high gains are perfectly stable, while real motors with the same gains oscillate and diverge. The paper replaces the default model with the correct causal PD implementation.

Latency modeling

The paper keeps a history of observations {(ti, Oi)} and interpolates to produce delayed readings. When the controller requests an observation at time t, it gets the interpolated value from time (t − tlatency). The latency was measured by sending a spike PWM signal and timing the sensor response: ~3ms for the microcontroller PD loop, ~15-19ms for the neural network policy on the Jetson TX2.

Validation: To test the actuator model, the authors commanded a sine-wave trajectory to a single motor and compared the simulated and real motor angle traces. With the new model, the curves match closely. With the default Bullet model, they diverge significantly. This single improvement — getting the motor model right — is responsible for much of the paper's sim-to-real success.
Actuator Model Comparison

Desired trajectory (gray), default Bullet response (warm/orange), and accurate DC motor response (teal). Toggle between models. The accurate model captures torque saturation and back-EMF.

What is the key difference between Bullet's default motor model and the paper's accurate actuator model?

Chapter 5: Policy Architecture

The paper uses a surprisingly small neural network for the locomotion controller. The philosophy: compact policies generalize better across the sim-to-real gap.

Observation space

The observations are deliberately minimal. For galloping (learned from scratch):

That's 12 dimensions. For trotting (with reference gait), even less — just 4 dimensions (roll, pitch, and their angular velocities). The motor angles are dropped because the reference signal already provides the leg trajectory.

Critically, they exclude several available sensors: yaw (drifts quickly), motor velocities (noisy), and foot contact forces (no sensor). This compact observation space is a form of regularization — it prevents the policy from overfitting to simulation-specific signals.

Action space: leg space, not motor space

Each leg has two motors (θ1, θ2). The paper parameterizes actions in leg space as (swing, extension) pairs:

θ1 = e + s     θ2 = e − s

Extension (e) sets the leg length by rotating both motors in opposite directions. Swing (s) rotates the entire leg by moving both motors in the same direction. This is much better than raw motor angles because it prunes out self-collision configurations that would waste learning time.

Hybrid policy: open-loop + feedback

The full action is:

a(t, o) = ā(t) + π(o)

Where ā(t) is an optional open-loop reference signal (e.g., a sine wave defining a trot pattern) and π(o) is the learned feedback controller. By adjusting the output bounds of π(o), users control how much the learned policy can deviate from the reference. Set bounds to zero for a fully user-specified gait; set ā(t) = 0 with wide bounds for fully learned-from-scratch behavior.

Network size

Both the policy and value networks are fully connected with two hidden layers. Sizes are found via hyperparameter search:

These are tiny by modern standards — roughly 20K-30K parameters. Small enough to run at 150-200 Hz on a Jetson TX2.

Design philosophy: The observation space, action space, and network size all follow the same principle: use the minimum representation that solves the task. Extra dimensions give the policy more room to overfit simulated artifacts, which hurts transfer. This is the opposite of the "bigger is better" trend in deep learning — for sim-to-real, smaller is more robust.
Why does the paper use leg space (swing, extension) instead of raw motor angles as the action space?

Chapter 6: Reward Shaping

The reward function is strikingly simple — two terms:

r = (pn − pn−1) · d  −  w Δt |τn · q̇n|

Unpacking the reward

Term 1: Forward progress. (pn − pn−1) · d measures how far the robot moved toward the desired direction d in one timestep. This directly rewards speed.

Term 2: Energy penalty.n · q̇n| is the instantaneous mechanical power (torque times angular velocity). The weight w = 0.008 balances speed against efficiency. The paper notes the algorithm is robust to a wide range of w values.

Episode termination

An episode ends after 1000 steps (~6 seconds) or when the base tilts more than 0.5 radians (~29 degrees). This implicit stability constraint means the policy must maintain balance to accumulate reward — no need for an explicit balance penalty.

What the reward doesn't specify

Notice what's absent from the reward: no gait pattern, no foot timing, no ground clearance, no symmetry. When learning from scratch (ā(t) = 0), the gait emerges entirely from the optimization. And remarkably, galloping — the fastest natural quadruped gait — is what usually emerges.

Reward minimalism: By only rewarding speed and penalizing energy, the authors let the optimization discover that galloping is the Pareto-optimal gait for fast, efficient quadruped locomotion. This is the same gait that natural selection "discovered" for horses, dogs, and cheetahs. RL converges to the same solution from pure optimization, without any biological prior.

Reference-guided trotting

When users want a specific gait style, they provide an open-loop reference. For trotting, two sine curves define the swing and extension of each leg:

s̄(t) = 0.3 sin(4πt)     ē(t) = 0.35 sin(4πt) + 2

Diagonal leg pairs share the same phase; the other pair is 180 degrees out of phase (defining a trot). The feedback controller π(o) has tight output bounds (±0.25 radians) so it can adjust for balance but can't deviate far from the reference style.

Why does galloping — rather than walking or hopping — typically emerge when training from scratch with only speed and energy rewards?

Chapter 7: Results

The paper evaluates two locomotion gaits on the Minitaur quadruped robot, both trained entirely in simulation using PPO.

Galloping (learned from scratch)

With no reference gait and wide feedback bounds, the RL system discovers galloping autonomously. Performance:

Trotting (reference-guided)

With a sine-wave reference and tight feedback bounds, the system learns balanced trotting. Using only 4-dimensional observations (IMU only, no motor angles):

The surprise: The learned gaits are not only transferable — they're better than expert-tuned gaits. At comparable speeds, learned galloping uses 35% less power and learned trotting uses 23% less power. RL found more energy-efficient motor coordination patterns than human engineers could design by hand.

Training details

Training runs 25 parallel rollouts per policy update, each up to 1000 steps. Galloping converges in ~3.25 hours (7 million simulation steps). Trotting takes ~4.35 hours. After training, policies are deployed on the real Minitaur with zero fine-tuning.

Reality gap measurement

The paper measures the reality gap as the difference in expected return between simulation and real-world runs. For each configuration, they train 100 controllers with different hyperparameters and random seeds, deploy the top 3 on hardware, and average 9 real-world runs (3 controllers, 3 runs each).

Sim vs Real Performance

Comparison of learned gaits (green) vs expert hand-crafted gaits (gray) on speed and power consumption.

How do the learned gaits compare to expert hand-crafted gaits in power consumption?

Chapter 8: Ablation

The paper carefully ablates each component to understand its contribution. Using trotting as the test case, they train 100 controllers for each configuration and deploy the top 3 on the real robot.

Impact of simulation accuracy

Three configurations were compared:

  1. Baseline simulation: Default Bullet motor model, no latency. Sim return: ~1.8, Real return: ~0.3. Massive gap.
  2. Baseline + perturbations: Adding random force perturbations during training. Sim return: ~1.5, Real return: ~0.5. Perturbations help but can't compensate for fundamentally wrong physics.
  3. Accurate sim + perturbations: New actuator model + latency + perturbations. Sim return: ~1.5, Real return: ~1.4. Near-zero gap.
The takeaway: Accurate simulation is the single most impactful factor. Going from baseline to accurate simulation + perturbations reduces the reality gap from ~1.5 to ~0.1. Perturbations alone (without accurate simulation) barely help.

Impact of domain randomization

With the accurate simulator, adding dynamics randomization increases the real-world expected return from ~3.6 to ~4.2 for galloping. The improvement is modest because the accurate simulator already gets most of the way there. Randomization provides the final margin of robustness.

Impact of observation space

This is a surprising result. For trotting with the reference gait:

The motor angle observations give the policy more information but also more opportunity to overfit to simulation-specific motor dynamics. By dropping them, the policy is forced to rely on the reference signal for leg trajectory and use IMU feedback only for balance — which transfers cleanly.

Impact of network size

Hyperparameter search over network sizes found that two hidden layers with 85-185 units per layer work well. Larger networks didn't improve performance in simulation and transferred worse to hardware.

Ablation: Reality Gap Reduction

Sim return (teal) vs real return (warm/orange) for three configurations. The gap between bars is the reality gap.

Why does reducing the observation space from 12D to 4D (removing motor angles) improve sim-to-real transfer for trotting?

Chapter 9: Connections

What this paper built on

PPO (Schulman et al., 2017): The RL algorithm used for training. PPO's stability and ease of parallelization made it the right choice for sim-to-real, where you need to train across thousands of randomized environments reliably.

Domain randomization (Tobin et al., 2017): Originally proposed for visual sim-to-real transfer in grasping (randomizing textures, lighting, camera positions). This paper extends the idea to dynamics randomization — randomizing physical parameters rather than visual ones.

Bullet Physics / PyBullet (Coumans et al.): The open-source physics engine used for simulation. The paper's actuator model improvements were contributed back to the Bullet library, benefiting the entire robotics community.

What this paper enabled

OpenAI In-Hand Manipulation (2019): Used domain randomization at massive scale (hundreds of parameters) to transfer dexterous manipulation policies for a Rubik's cube from simulation to a Shadow Hand. Directly inspired by this paper's approach.

ANYmal / ETH Zurich (2019-2023): Applied sim-to-real with domain randomization to larger quadrupeds, eventually achieving parkour-level agility. Lee et al. (2020) trained ANYmal to walk over rough terrain using similar techniques.

Boston Dynamics / Spot: While Boston Dynamics uses more traditional control methods, the sim-to-real paradigm demonstrated here influenced the field's trajectory toward learning-based locomotion controllers.

Domain adaptation alternatives: This paper takes the "make training robust" approach. An alternative is domain adaptation: train a mapping between sim and real domains. RCAN (James et al., 2019) and DARC (Eysenbach et al., 2020) pursue this direction. Teacher-student distillation (Chen et al., 2020) uses a teacher with privileged information in sim to train a student with realistic observations.

Legacy: This paper established the template for sim-to-real locomotion that the field still follows: accurate simulation + domain randomization + compact policies + careful ablation. Every major result in learned quadruped locomotion since 2018 — from ANYmal parkour to Unitree Go1 agility — uses some variant of this recipe. The paper also demonstrated that RL can discover gaits that are not just functional but superior to expert-designed ones, shifting the field's expectation of what learned controllers can achieve.

Cheat sheet

Core method
Accurate actuator model + latency simulation + domain randomization + compact policy
Key numbers
1.18 m/s gallop, 0.60 m/s trot, 35% power reduction, zero fine-tuning
Policy
2-layer MLP (185, 95), 12D obs, PPO training, ~3 hours
Randomization
10 parameters: mass, friction, inertia, motor strength, latency, voltage, IMU noise
Impact
Template for all subsequent sim-to-real locomotion (ANYmal, OpenAI Hand, Unitree)
What is the key difference between domain randomization (this paper's approach) and domain adaptation for sim-to-real transfer?