How Google Brain trained a quadruped robot to gallop and trot using deep RL in simulation — then deployed the policies on real hardware with zero fine-tuning, by combining accurate actuator models, latency simulation, and domain randomization.
Making a four-legged robot run is one of robotics' oldest headaches. The robot is under-actuated — it has fewer motors than degrees of freedom — and every stride involves split-second contact switches between feet and ground. Classical approaches require an expert to hand-tune gait parameters for weeks. One wrong coefficient and the robot face-plants.
Deep RL seemed like the answer: let an algorithm discover locomotion from scratch. By 2018, simulated agents had learned impressive gaits — bipeds running, dogs navigating terrains. But every one of these successes lived inside a simulator. The moment you uploaded the policy to a real robot, it failed.
Why? Because the simulator lies. It models motors as perfect torque sources (they aren't). It ignores latency (real sensors are delayed). It uses wrong friction coefficients (every floor is different). The policy learned to exploit the simulator's quirks rather than the physics of the real world.
A policy trained in simulation (left) encounters different physics on a real robot (right). The gap between them is the "reality gap." Toggle to see how performance degrades.
The paper's central thesis is that the reality gap can be narrowed from both sides simultaneously:
The combination works because accurate simulation gets you close to reality (reducing the gap from "impossible" to "small"), and domain randomization makes the policy tolerant of whatever residual gap remains. Neither alone is sufficient. Without accurate simulation, randomization ranges would need to be enormous, producing overly conservative gaits. Without randomization, even a good simulator has small discrepancies that compound over hundreds of control steps.
The result: a Minitaur quadruped that gallops at 1.18 m/s (2.18 body lengths per second) in the real world, deployed directly from simulation with no fine-tuning. The learned gaits actually consume 23-35% less power than expert-tuned hand-crafted gaits.
Before we can close the gap, we need to understand exactly what causes it. The paper identifies four major sources of mismatch between simulation and reality:
The default physics engine (Bullet) models motors using a constraint-based approach: it solves for the torque that makes the motor hit its target angle by the end of the timestep. This is not how real DC motors work. Real motors have limited torque, nonlinear torque-current curves, and back-EMF that reduces torque at high speeds. The simulated motors are effectively infinitely strong and perfectly responsive.
In the simulator, commands take effect instantly and sensors report back immediately. On the real Minitaur, there's a 15-19ms delay between sending a command and seeing its effect in sensor readings. The on-board PD controller runs on a microcontroller with ~3ms latency, while the neural network policy runs on an Nvidia Jetson TX2 with higher latency. This delay shrinks the stability region of any feedback controller.
The simulated IMU gives perfect roll, pitch, and angular velocity readings. The real IMU has bias drift (especially yaw, which they exclude entirely) and Gaussian noise. Motor velocity estimates are noisy because they're computed from finite differences of encoder readings.
The friction between rubber feet and carpet varies from floor to floor. The contact model in Bullet (LCP-based) is an approximation of real contact mechanics. Ground compliance, foot deformation, and sliding transitions are all simplified.
Each source contributes error that compounds during locomotion. Click each source to see its impact.
The core idea: at the start of each training episode, randomly sample a set of physics parameters from predefined ranges. The policy then trains across thousands of slightly different "worlds," and must learn a controller that works in all of them.
The paper randomizes 10 physical parameters. The ranges are chosen carefully — too narrow and the policy overfits; too wide and it becomes overly conservative (slow, cautious gaits).
Mass and motor friction have tight ranges (±20%) because they were carefully measured during system identification. Inertia has a wider range (±50%) because it was estimated from shape and mass assuming uniform density — a rough approximation. Contact friction varies between 0.5 and 1.25 to cover different carpet surfaces. The principle: randomize more where you're less certain.
In addition to parameter randomization, the paper applies random force perturbations to the robot's base every 200 simulation steps (~1.2 seconds). Each perturbation lasts 10 steps (~0.06 seconds), with random direction and magnitude between 130N and 220N. This forces the policy to learn balance recovery from arbitrary disturbances.
Each dot is one training episode with different randomized parameters. The policy must handle all of them. Drag the slider to adjust randomization range.
Domain randomization can only compensate for so much error. If your simulator is fundamentally wrong — if it models motors as perfect torque sources when real motors saturate — no amount of randomization will fix that. This chapter covers the simulation improvements that close the gap from the other side.
The real Minitaur uses direct-drive DC motors controlled via PWM signals. The torque of a DC motor follows:
Kt is the torque constant, Vemf = Kt q̇ is the back-EMF (voltage generated by a spinning motor that opposes the input), and R is armature resistance. This means torque decreases as the motor spins faster — something the default Bullet model ignores entirely.
But even this isn't enough. The linear torque-current relation τ = KtI only holds for ideal motors. Real motors exhibit torque saturation: at high currents, the torque plateaus rather than continuing to increase linearly. The paper models this with a piecewise-linear function calibrated from motor testing data.
The Minitaur uses position control (PD servo) to actuate motors. The default Bullet implementation is:
This solves for the torque that makes the motor exactly hit its target by the end of the timestep — an implicit constraint that guarantees convergence. Real PD control uses current state, not future state:
This subtle difference means simulated motors with high gains are perfectly stable, while real motors with the same gains oscillate and diverge. The paper replaces the default model with the correct causal PD implementation.
The paper keeps a history of observations {(ti, Oi)} and interpolates to produce delayed readings. When the controller requests an observation at time t, it gets the interpolated value from time (t − tlatency). The latency was measured by sending a spike PWM signal and timing the sensor response: ~3ms for the microcontroller PD loop, ~15-19ms for the neural network policy on the Jetson TX2.
Desired trajectory (gray), default Bullet response (warm/orange), and accurate DC motor response (teal). Toggle between models. The accurate model captures torque saturation and back-EMF.
The paper uses a surprisingly small neural network for the locomotion controller. The philosophy: compact policies generalize better across the sim-to-real gap.
The observations are deliberately minimal. For galloping (learned from scratch):
That's 12 dimensions. For trotting (with reference gait), even less — just 4 dimensions (roll, pitch, and their angular velocities). The motor angles are dropped because the reference signal already provides the leg trajectory.
Critically, they exclude several available sensors: yaw (drifts quickly), motor velocities (noisy), and foot contact forces (no sensor). This compact observation space is a form of regularization — it prevents the policy from overfitting to simulation-specific signals.
Each leg has two motors (θ1, θ2). The paper parameterizes actions in leg space as (swing, extension) pairs:
Extension (e) sets the leg length by rotating both motors in opposite directions. Swing (s) rotates the entire leg by moving both motors in the same direction. This is much better than raw motor angles because it prunes out self-collision configurations that would waste learning time.
The full action is:
Where ā(t) is an optional open-loop reference signal (e.g., a sine wave defining a trot pattern) and π(o) is the learned feedback controller. By adjusting the output bounds of π(o), users control how much the learned policy can deviate from the reference. Set bounds to zero for a fully user-specified gait; set ā(t) = 0 with wide bounds for fully learned-from-scratch behavior.
Both the policy and value networks are fully connected with two hidden layers. Sizes are found via hyperparameter search:
These are tiny by modern standards — roughly 20K-30K parameters. Small enough to run at 150-200 Hz on a Jetson TX2.
The reward function is strikingly simple — two terms:
Term 1: Forward progress. (pn − pn−1) · d measures how far the robot moved toward the desired direction d in one timestep. This directly rewards speed.
Term 2: Energy penalty. |τn · q̇n| is the instantaneous mechanical power (torque times angular velocity). The weight w = 0.008 balances speed against efficiency. The paper notes the algorithm is robust to a wide range of w values.
An episode ends after 1000 steps (~6 seconds) or when the base tilts more than 0.5 radians (~29 degrees). This implicit stability constraint means the policy must maintain balance to accumulate reward — no need for an explicit balance penalty.
Notice what's absent from the reward: no gait pattern, no foot timing, no ground clearance, no symmetry. When learning from scratch (ā(t) = 0), the gait emerges entirely from the optimization. And remarkably, galloping — the fastest natural quadruped gait — is what usually emerges.
When users want a specific gait style, they provide an open-loop reference. For trotting, two sine curves define the swing and extension of each leg:
Diagonal leg pairs share the same phase; the other pair is 180 degrees out of phase (defining a trot). The feedback controller π(o) has tight output bounds (±0.25 radians) so it can adjust for balance but can't deviate far from the reference style.
The paper evaluates two locomotion gaits on the Minitaur quadruped robot, both trained entirely in simulation using PPO.
With no reference gait and wide feedback bounds, the RL system discovers galloping autonomously. Performance:
With a sine-wave reference and tight feedback bounds, the system learns balanced trotting. Using only 4-dimensional observations (IMU only, no motor angles):
Training runs 25 parallel rollouts per policy update, each up to 1000 steps. Galloping converges in ~3.25 hours (7 million simulation steps). Trotting takes ~4.35 hours. After training, policies are deployed on the real Minitaur with zero fine-tuning.
The paper measures the reality gap as the difference in expected return between simulation and real-world runs. For each configuration, they train 100 controllers with different hyperparameters and random seeds, deploy the top 3 on hardware, and average 9 real-world runs (3 controllers, 3 runs each).
Comparison of learned gaits (green) vs expert hand-crafted gaits (gray) on speed and power consumption.
The paper carefully ablates each component to understand its contribution. Using trotting as the test case, they train 100 controllers for each configuration and deploy the top 3 on the real robot.
Three configurations were compared:
With the accurate simulator, adding dynamics randomization increases the real-world expected return from ~3.6 to ~4.2 for galloping. The improvement is modest because the accurate simulator already gets most of the way there. Randomization provides the final margin of robustness.
This is a surprising result. For trotting with the reference gait:
The motor angle observations give the policy more information but also more opportunity to overfit to simulation-specific motor dynamics. By dropping them, the policy is forced to rely on the reference signal for leg trajectory and use IMU feedback only for balance — which transfers cleanly.
Hyperparameter search over network sizes found that two hidden layers with 85-185 units per layer work well. Larger networks didn't improve performance in simulation and transferred worse to hardware.
Sim return (teal) vs real return (warm/orange) for three configurations. The gap between bars is the reality gap.
PPO (Schulman et al., 2017): The RL algorithm used for training. PPO's stability and ease of parallelization made it the right choice for sim-to-real, where you need to train across thousands of randomized environments reliably.
Domain randomization (Tobin et al., 2017): Originally proposed for visual sim-to-real transfer in grasping (randomizing textures, lighting, camera positions). This paper extends the idea to dynamics randomization — randomizing physical parameters rather than visual ones.
Bullet Physics / PyBullet (Coumans et al.): The open-source physics engine used for simulation. The paper's actuator model improvements were contributed back to the Bullet library, benefiting the entire robotics community.
OpenAI In-Hand Manipulation (2019): Used domain randomization at massive scale (hundreds of parameters) to transfer dexterous manipulation policies for a Rubik's cube from simulation to a Shadow Hand. Directly inspired by this paper's approach.
ANYmal / ETH Zurich (2019-2023): Applied sim-to-real with domain randomization to larger quadrupeds, eventually achieving parkour-level agility. Lee et al. (2020) trained ANYmal to walk over rough terrain using similar techniques.
Boston Dynamics / Spot: While Boston Dynamics uses more traditional control methods, the sim-to-real paradigm demonstrated here influenced the field's trajectory toward learning-based locomotion controllers.
Domain adaptation alternatives: This paper takes the "make training robust" approach. An alternative is domain adaptation: train a mapping between sim and real domains. RCAN (James et al., 2019) and DARC (Eysenbach et al., 2020) pursue this direction. Teacher-student distillation (Chen et al., 2020) uses a teacher with privileged information in sim to train a student with realistic observations.