Training Data & Pipelines — Engineermaxxing

Introduction

GPT-3 was trained on 300 billion tokens. GPT-4 reportedly consumed trillions. CLIP was trained on 400 million image-text pairs. LAION-5B provides 5.85 billion image-text pairs for open research. These numbers are staggering, but they share a common origin: the data was scraped from the internet, where humans have spent decades producing text and images at planetary scale.

Robotics has no internet. Every trajectory in a robot learning dataset was collected by a physical robot, in a physical environment, controlled by a human operator or an autonomous policy — one episode at a time. A skilled teleoperator can collect roughly 100–200 episodes per day. A lab running three shifts with multiple operators might manage 500. At this rate, collecting one million episodes takes approximately 5,500 operator-days — about 15 years of continuous work for a single person.

This is the fundamental constraint of robot learning: data is scarce, expensive, and slow to collect. Every advance in VLA model architecture — no matter how elegant — runs headlong into this wall. RT-2 can use a 55B-parameter vision-language model backbone, but it still needs robot demonstrations to learn manipulation. OpenVLA can inherit broad visual and language understanding from pretrained models, but it still needs to see what "pick up the blue cup" looks like when executed by a physical robot arm.

This article surveys the landscape of robot learning data: the major datasets that exist today, how they were collected, how they are formatted, and what we have learned about how performance scales with data quantity and diversity. We will examine the ambitious vision of cross-embodiment generalization — training a single policy on data from many different robot types — and the practical challenges of making that work.

ℹ What this article covers

We cover the major robot learning datasets (Open X-Embodiment, DROID, Bridge V2), data collection methods (kinesthetic teaching, teleoperation, VR, autonomous collection), data formats (RLDS), cross-embodiment generalization, data scaling laws, augmentation strategies for robotics, and sim-to-real transfer. Code examples show how to load and process robot learning data in practice.

The Data Landscape

The scale gap

To appreciate the data challenge in robotics, consider the orders-of-magnitude gap between robot data and data in other domains:

Domain	Dataset	Scale	Source
NLP	Common Crawl (RedPajama)	~30 trillion tokens	Web scraping
NLP	The Pile	825 GB text	Curated web + books
Vision	LAION-5B	5.85 billion image-text pairs	Web scraping
Vision	ImageNet-21k	14.2 million images	Manual annotation
Video	WebVid-10M	10.7 million video clips	Web scraping
Robotics	Open X-Embodiment	~1 million episodes	21 institutions, years of effort
Robotics	DROID	76K trajectories	13 institutions, standardized protocol
Robotics	Bridge V2	60K+ trajectories	Single lab, WidowX robot
Robotics	RT-1 training set	~130K episodes	Google, 17 months of collection

The gap is roughly six orders of magnitude. NLP has trillions of tokens; robotics has millions of episodes, where each episode might contain 50–500 timesteps. Even counting individual observation-action pairs (perhaps 100 million across all datasets combined), we are still four orders of magnitude behind language and three behind vision.

Dataset Scale Comparison — Robotics vs. NLP & Vision Interactive

Compare dataset sizes across domains on a logarithmic scale. Hover over bars for details. The gulf between robot data and internet data is the defining constraint of the field.

Logarithmic scale — each gridline is 10x more data

Why so little data?

Several factors conspire to keep robot datasets small:

Physical collection is slow. A teleoperation episode for a pick-and-place task takes 10–60 seconds, plus time for environment reset. With overhead for setup, calibration, and operator fatigue, throughput is 10–20 episodes per hour.
Hardware is expensive and fragile. A single Franka Emika Panda arm costs ~$30,000 (the FR3 model, released in 2023, is somewhat cheaper). Add sensors, grippers, a compute stack, and a structured workspace, and each data collection station costs $50,000–$100,000. Hardware breaks: motors burn out, cables fray, sensors degrade.
Environments require manual design. Every scene — which objects, where they are placed, what the table looks like — must be set up by a human. To get diversity in object appearance, backgrounds, and arrangements, someone has to physically rearrange the scene between episodes.
Operator skill varies. Teleoperation quality depends on the operator. Noisy, inconsistent demonstrations introduce noise into the training data. Unlike web data, there is no natural quality signal (no likes, no clicks, no editorial filtering).
No sharing standard existed. Until recently, every lab stored data in its own format, with its own action conventions, camera setups, and naming schemes. Combining datasets required heroic engineering effort.

💡 The bitter lesson for robotics

Rich Sutton's "bitter lesson" argues that general methods leveraging computation and data ultimately outperform those relying on human-engineered domain knowledge. The success of LLMs validated this for language. But robotics cannot yet follow this playbook because the data does not exist at scale. The question is whether clever data collection, simulation, and cross-embodiment transfer can close the gap — or whether robotics needs a fundamentally different approach.

Open X-Embodiment

The Open X-Embodiment (OXE) dataset, introduced by Padalkar et al. (2023) as part of the RT-X project, represents the largest coordinated effort to pool robot learning data across institutions. It is a consortium-driven dataset created by 21 research institutions, combining demonstrations from 22 different robot embodiments into a single unified dataset.

The key statistics:

~1 million episodes of robot manipulation data
22 robot embodiments — from the Google Robot and Franka Panda to WidowX, xArm, ALOHA, and mobile manipulators
527 skills described in natural language instructions
160,266 tasks across diverse manipulation behaviors
Data contributed by labs at Google, Stanford, UC Berkeley, CMU, Columbia, MIT, and others

The central insight motivating OXE is that cross-embodiment training can produce policies that outperform single-dataset training. The RT-X experiments demonstrated that a policy trained on the full OXE mixture outperformed specialist policies trained on any single constituent dataset for the majority of evaluation tasks. Positive transfer occurs even between very different robots, suggesting that manipulation skills share common structure across embodiments.

Assembly and standardization

Assembling OXE required solving massive standardization challenges. Each contributing lab had its own data format, naming conventions, action representations, and camera configurations. The OXE team converged on several key decisions:

RLDS format. All data was converted to the Reinforcement Learning Datasets (RLDS) format, built on TensorFlow Datasets. This provides a consistent structure: datasets contain episodes, episodes contain timesteps, each timestep has an observation dict, action, reward, and metadata.
Action space unification. Actions were standardized to a 7-dimensional end-effector delta format: [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]. Labs using joint-space control had their data converted using forward kinematics. Absolute position commands were differenced into deltas.
Language annotations. Each episode includes a natural language instruction describing the task. For datasets that lacked language labels, annotators added them retroactively. The quality and specificity of these annotations varies significantly across constituent datasets.

Challenges

Despite standardization efforts, significant heterogeneity remains in OXE:

Dimension	Variation Across OXE	Impact
Control frequency	3 Hz (some mobile robots) to 50 Hz (dexterous hands)	Temporal dynamics differ; same action delta means different things at different frequencies
Camera viewpoint	Third-person overhead, angled, wrist-mounted, egocentric	Visual features are viewpoint-dependent; a wrist camera trajectory looks nothing like an overhead view
Image resolution	64×64 to 640×480	Fine-grained manipulation details may be invisible in low-resolution data
Action magnitude	Millimeter-scale (precision assembly) to meter-scale (navigation)	Naïve normalization can wash out fine-grained actions or clip large ones
Gripper type	Parallel jaw, suction cup, dexterous hand, soft gripper	A single gripper dimension cannot represent multi-finger dexterous manipulation
Task complexity	Simple pick-place to multi-step cooking sequences	Mixing simple and complex tasks creates imbalanced training signals

ℹ Dataset mixing ratios matter

Naïvely mixing all OXE datasets with uniform sampling hurts performance. The RT-X paper found that carefully tuned mixing ratios — oversampling high-quality datasets and undersampling noisy or out-of-distribution ones — was critical. This is analogous to the data mixture tuning done for LLM pretraining (e.g., Llama's careful balancing of code, web text, and books), but with fewer principled methods for determining optimal ratios.

DROID

DROID (Distributed Robot Interaction Dataset), introduced by Khazatsky et al. (2024), takes a complementary approach to OXE. Instead of pooling heterogeneous data from many robot types, DROID standardizes the entire collection protocol around a single robot platform: the Franka Emika Panda. The result is a more homogeneous but deeply diverse dataset.

Key statistics:

76,000 trajectories of bimanual and single-arm manipulation
564 scenes with diverse backgrounds, lighting conditions, and object configurations
86 tasks including pick-and-place, drawer manipulation, pouring, and tool use
Collected across 13 institutions using a standardized hardware and software protocol
Multi-view cameras: two external cameras plus a wrist-mounted camera per arm

The design philosophy of DROID prioritizes visual diversity over embodiment diversity. By fixing the robot platform, DROID eliminates action space heterogeneity entirely — every trajectory uses the same joint configuration, the same gripper, the same control frequency (typically 20 Hz). The diversity comes from environments: different kitchens, offices, lab benches, different lighting (natural light, fluorescent, dim), different objects (hundreds of everyday items), and different operators.

This turns out to be a powerful strategy. Policies trained on DROID exhibit strong zero-shot generalization to novel visual environments because they have seen the same manipulation skills in many different visual contexts. The fixed embodiment means the policy does not need to disentangle "what to do" from "how this specific robot moves."

💡 DROID vs. OXE: which diversity matters more?

OXE maximizes embodiment diversity (many robots). DROID maximizes scene diversity (many environments). The emerging answer from empirical results is that both matter, but for different purposes. Embodiment diversity enables cross-robot transfer. Scene diversity enables visual generalization. For deploying a specific robot in new environments, scene diversity may be more immediately valuable.

Bridge V2

Bridge V2 (Walke et al., 2023) demonstrates a different philosophy: low-cost data collection at scale using inexpensive hardware. The dataset was collected using the WidowX 250 robot arm, which costs approximately $3,000 — an order of magnitude less than a Franka Panda.

Key statistics:

60,000+ trajectories across 13 manipulation skills
Collected at Stanford with a standardized toy kitchen environment
Objects include common household items: cups, bowls, pots, utensils, sponges, and others
Actions in end-effector delta format at ~5 Hz
Single third-person camera view (and some episodes with wrist camera)

Bridge V2 has become one of the most widely used datasets in VLA research for several reasons. First, the low-cost hardware means other labs can easily replicate the setup and collect compatible data. Second, the task distribution (kitchen manipulation) is a practical and compelling domain. Third, the dataset has been carefully curated with language annotations for every episode.

OpenVLA (Kim et al., 2024) used Bridge V2 as one of its primary evaluation benchmarks, demonstrating that a 7B-parameter VLA model fine-tuned on Bridge V2 data could generalize to novel objects and scene configurations not seen during training. The dataset's size and quality make it a de facto standard for evaluating manipulation policies on low-cost hardware.

Open X-Embodiment

Cross-Embodiment Scale

~1M episodes, 22 robots, 527 skills. Maximum breadth: many robot types, many tasks, many institutions. Heterogeneous quality and formats unified through RLDS.

DROID

Controlled Visual Diversity

76K trajectories, 1 robot platform, 564 scenes. Fixed embodiment, maximum visual diversity. Standardized protocol enables consistent quality across 13 labs.

Bridge V2

Low-Cost Depth

60K+ trajectories on ~$3K hardware. Proves that useful robot data does not require expensive robots. Kitchen manipulation with careful curation and language annotations.

RT-1 Dataset

Single-Lab Industrial Scale

~130K episodes collected over 17 months at Google with the Everyday Robots mobile manipulator. 700+ tasks, consistent quality, but only one robot type and one environment.

Data Collection Methods

Every robot learning dataset requires a method for generating demonstration trajectories. The choice of collection method has profound effects on data quality, throughput, and the types of behaviors that can be demonstrated. There are four main paradigms, each with distinct tradeoffs.

Kinesthetic teaching

The most intuitive method: a human physically grasps the robot arm and guides it through the desired motion. The robot's encoders record joint positions at each timestep, producing a trajectory in joint space. Kinesthetic teaching is used when the robot's compliance mode allows safe physical interaction (e.g., Franka Panda in gravity compensation mode).

Advantages: No additional hardware required. The operator feels the forces and contacts, enabling demonstrations of delicate tasks (inserting a peg, wiping a surface). The resulting trajectories are smooth and physically consistent because the operator is directly controlling the robot's actual body.

Disadvantages: Slow — the operator must stand at the robot, which is physically tiring. Limited to robots with compliant control modes. Does not scale to remote operation. The operator's body can occlude cameras. For bimanual tasks, it requires two operators or sequential demonstration of each arm.

Leader-follower (ALOHA)

In a leader-follower setup, the operator controls a "leader" robot, and a "follower" robot mirrors the leader's movements in real time. The leader can be a low-cost puppet arm with matched kinematics, making it easy to telecommand complex motions. The ALOHA system (Zhao et al., 2023) popularized this approach for bimanual manipulation.

ALOHA uses paired ViperX 300 arms as leader and follower. The leader arms have no motors engaged — the operator moves them freely, and the follower arms replicate the motion. This enables natural bimanual demonstrations (cooking, folding, assembling) that would be extremely difficult with joystick-based teleoperation.

Advantages: Excellent for bimanual tasks. Low latency between operator intent and robot motion. The operator can feel resistance through the leader arm (passive feedback). Relatively inexpensive leader hardware.

Disadvantages: Requires matched kinematics between leader and follower (or careful retargeting). The operator must be co-located with the robot. Workspace is limited by the leader arm's reach.

VR teleoperation

The operator wears a VR headset and uses hand controllers (or hand tracking) to specify desired end-effector poses. The robot tracks these poses using inverse kinematics and a real-time control loop. The operator sees the robot's camera feed in the VR headset, providing an immersive first-person view. DROID uses this approach extensively, with operators using Meta Quest headsets.

Advantages: Remote operation is possible (the operator does not need to be near the robot). Natural 6-DoF hand motion is easy to demonstrate. Can be extended to dexterous hand control via finger tracking. Comfortable for long sessions because the operator can sit.

Disadvantages: VR introduces latency (50–200ms round-trip), which degrades demonstration quality for fast or contact-rich tasks. Lack of haptic feedback means the operator cannot feel when the robot contacts an object, leading to collisions and imprecise grasps. Requires calibration between VR and robot coordinate frames.

Autonomous data collection

Once a robot has a basic policy, it can collect additional data autonomously by executing the policy and recording outcomes. A human reviews the results and labels them as success or failure (or a reward model does this automatically). This bootstrapping approach was used by Google for RT-1: an initial dataset was collected via teleoperation, a policy was trained, and that policy was deployed to collect more data, which was filtered and added to the training set.

Advantages: Massively scalable — a fleet of robots can collect data 24/7 without human operators. The data distribution matches the policy's actual execution, which can reduce distribution shift during training.

Disadvantages: Requires an already-functional policy (chicken-and-egg problem). Autonomous data is often lower quality than human demonstrations. Failure cases must be filtered, which requires labeling effort. Self-reinforcing biases can emerge if the policy only collects data in states it already handles well.

Kinesthetic

Physical Guidance

Human guides robot by hand. High quality, slow throughput. ~10–15 episodes/hour. Best for contact-rich, force-sensitive tasks.

Leader-Follower

Puppet Control (ALOHA)

Mirror arms replicate operator motion. Natural bimanual control. ~15–25 episodes/hour. Best for two-arm coordination tasks.

VR Teleoperation

Remote Immersive Control

VR headset + controllers specify end-effector pose. Remote capable. ~15–20 episodes/hour. Best for single-arm 6-DoF manipulation.

Autonomous

Policy Self-Play

Trained policy collects its own data. Scalable to robot fleets. Requires filtering. Best for scaling after initial policy exists.

Data Collection Pipeline — From Teleoperation to Training Interactive

Hover over each stage to see details. Click a collection method to see how the pipeline changes.

VR teleoperation — operator controls robot remotely via headset

Data Formats and Standards

RLDS format

The Reinforcement Learning Datasets (RLDS) format, built on TensorFlow Datasets (TFDS), has emerged as the de facto standard for robot learning data. It provides a hierarchical structure:

Dataset → [Episode0, Episode1, ..., EpisodeN] Episode → [Step0, Step1, ..., StepT] Step → {observation, action, reward, is_terminal, is_first, is_last, language_instruction}

Each observation is a dictionary that can contain:

image — RGB image from the primary camera (typically 256×256 or 224×224)
wrist_image — RGB image from the wrist-mounted camera
state — proprioceptive state (joint positions, velocities, gripper state)

Each action is a 7-dimensional vector: [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]. This standardized representation allows models trained on one dataset to be fine-tuned or evaluated on others, provided the action semantics are compatible.

RLDS leverages the TFDS infrastructure for efficient storage (TFRecord files), lazy loading, sharding for distributed training, and versioning. Datasets are registered with a builder class that defines the data schema, enabling tfds.load('bridge_dataset') to retrieve and parse the data with a single function call.

Action normalization

Raw action values vary enormously across datasets. A WidowX arm operating in a small workspace might have Δx values in the range [−0.02, 0.02] meters, while a mobile manipulator base might have [−0.5, 0.5] meters. Without normalization, a model trained on mixed data would be dominated by the dataset with the largest action magnitudes.

The standard approach is per-dataset normalization: for each constituent dataset in the mixture, compute the mean and standard deviation of each action dimension across all timesteps, then normalize to zero mean and unit variance:

â i = (a i - μ i) / σ i

At inference time, predicted actions are denormalized using the statistics of the target robot's dataset. This simple scheme works surprisingly well in practice, though it assumes that the action distributions are approximately Gaussian — an assumption that is violated for gripper commands (which are typically binary: open or close) and for actions near workspace boundaries.

An alternative is min-max normalization to a fixed range like [−1, 1]:

â i = 2 \times (a i - min i) / (max i - min i) - 1

Octo (Ghosh et al., 2024) uses this approach, normalizing actions per-dataset to [−1, 1] and treating the gripper dimension separately (binarized to 0 or 1). The choice between z-score and min-max normalization is empirical; both work, but consistency across the training pipeline is more important than the specific method.

ℹ Action tokenization vs. continuous actions

RT-2 takes a different approach entirely: it tokenizes continuous actions into discrete bins. Each action dimension is quantized into 256 bins, and the model predicts action tokens just like language tokens. This eliminates the normalization problem (bins are robot-specific) but introduces quantization error. For 7 action dimensions with 256 bins each, the position resolution is (action range) / 256 per dimension — about 0.08mm for a typical 20mm workspace range.

Cross-Embodiment Generalization

The most ambitious promise of large-scale robot datasets is cross-embodiment transfer: training a single policy on data from many different robot types, and having that policy generalize to robots it has never seen — or at least transfer useful representations that accelerate fine-tuning on a new platform.

This idea seems implausible at first. A Franka Panda has 7 joints with specific kinematics, torque limits, and dynamics. A WidowX 250 has 6 joints with completely different dimensions. A mobile manipulator has a base that can drive around. How can data from these diverse platforms help each other?

The answer lies in the abstraction layers. At the level of raw joint commands, robot data is incompatible. But at the level of end-effector motion in Cartesian space, the data becomes comparable. "Move the gripper 5cm to the right and 3cm down" is a meaningful instruction regardless of whether the robot has 6 joints or 7, whether it is a Franka or a WidowX. This is why the 7-DoF end-effector delta action space is the standard: it provides a robot-agnostic description of manipulation behavior.

The RT-X experiments (part of the Open X-Embodiment project) demonstrated cross-embodiment transfer concretely. The key findings:

RT-1-X (an RT-1 architecture trained on the OXE mixture) showed a 50% improvement in success rate compared to the original RT-1 trained only on Google Robot data, when evaluated on non-Google robots.
RT-2-X (an RT-2 architecture trained on the OXE mixture) showed a 3x improvement in emergent skill evaluation — the ability to perform tasks that were not well-represented in any single constituent dataset but appeared when diverse datasets were combined.
Not all cross-embodiment transfer is positive. For some robot-task pairs, adding out-of-domain data hurt performance. Careful dataset mixing was necessary.

Cross-Embodiment Architecture — Shared Representations Across Robots Interactive

Different robots contribute data to a shared visual-language representation. Hover over components to see how data flows from diverse embodiments to a unified policy.

Data flows from multiple embodiments into a shared vision-language backbone

The mechanism of cross-embodiment transfer is still not fully understood. Several hypotheses:

Visual feature sharing. All robots manipulate objects in similar environments. The visual encoder learns object recognition, spatial relationships, and affordance estimation that transfer across embodiments because the visual world is shared.
Language grounding. The instruction "pick up the red block" means the same thing regardless of the robot. The language-conditioned policy learns task semantics that are embodiment-independent.
Motion primitive transfer. Reaching, grasping, and placing share geometric structure across embodiments when expressed in end-effector coordinates. A "reach forward and down" trajectory has similar structure on any arm.

💡 The analogy to multilingual NLP

Cross-embodiment transfer in robotics is analogous to multilingual training in NLP. Just as training a language model on English, French, and German produces better performance on all three languages (via shared syntactic and semantic representations), training a robot policy on Franka, WidowX, and xArm data can improve performance on each platform. The "languages" are different action spaces; the shared structure is the manipulation task itself.

Data Scaling Laws

In language modeling, scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) provide power-law relationships between dataset size, model size, compute, and loss. These laws have guided billion-dollar training decisions: if you know the power-law exponent, you can predict how much data you need for a target performance level.

Do similar scaling laws hold for robot learning? The evidence is limited but suggestive. The RT-1 paper (Brohan et al., 2022) conducted scaling experiments by training on subsets of their 130K-episode dataset and measuring task success rate:

Training Episodes	Seen Tasks Success	Unseen Tasks Success
10K	~55%	~15%
30K	~72%	~24%
60K	~82%	~32%
100K	~90%	~40%
130K (full)	~97%	~76%

Several observations from these and related experiments:

Performance on seen tasks follows a roughly logarithmic curve: each doubling of data produces diminishing improvements. Going from 10K to 30K (3x) yielded +17 points; going from 60K to 130K (~2x) yielded +15 points.
Generalization to unseen tasks shows a different pattern: it appears to improve faster at higher data scales. The jump from 100K to 130K produced a disproportionate improvement in unseen task success (+36 points), suggesting that generalization may require a critical mass of diversity before it kicks in.
Data diversity matters as much as quantity. 100K episodes of the same 10 tasks yields less generalization than 50K episodes spanning 100 tasks. The RT-1 scaling experiments controlled for this by maintaining the task distribution while varying total episode count.

Data Scaling Laws — Performance vs. Dataset Size Interactive

Hover over data points to see exact values. Toggle between seen and unseen task metrics to see how generalization scales differently from memorization.

Both curves — generalization scales differently from seen-task performance

The absence of well-characterized scaling laws for robotics is a significant gap. In NLP, the Chinchilla scaling laws tell you exactly how to allocate your compute budget between model size and data size. In robotics, we do not know the optimal ratio. Preliminary evidence suggests that robot learning is more data-hungry relative to model size than language modeling, because each episode provides far less independent information than a random web page.

ℹ Data efficiency through pretraining

VLA models achieve much better data efficiency than training from scratch because they inherit visual and language representations from internet-scale pretraining. RT-2's vision-language backbone was pretrained on billions of image-text pairs; only the action prediction needed to be learned from robot data. This is why a 55B-parameter model can learn useful manipulation policies from "only" 130K robot episodes — the robot data is fine-tuning a model that already understands objects, spatial relationships, and language.

Data Augmentation

Given the scarcity of robot data, augmentation is essential. But robotics imposes constraints on which augmentations are valid. In image classification, random horizontal flips are standard — a cat flipped horizontally is still a cat. In robotics, a horizontally flipped image shows the robot arm on the wrong side, with objects in mirrored positions. The corresponding action should also be mirrored, which requires careful handling.

Augmentations that are safe for robot learning:

Random crops. Cropping a 256×256 image to a random 224×224 region simulates small camera shifts. This is the single most effective augmentation for robotic manipulation (consistently improving generalization across many papers). The crop offset is small enough that the spatial relationship between the robot and objects is approximately preserved.
Color jitter. Randomly adjusting brightness, contrast, saturation, and hue simulates varying lighting conditions. This is especially valuable when the training data was collected in a single lab with fixed lighting.
Gaussian noise. Adding pixel-level noise improves robustness to sensor noise and compression artifacts.
Random erasing / cutout. Randomly masking rectangular patches forces the model to not rely on any single spatial region, improving robustness to partial occlusion.

Augmentations that are dangerous or require special handling:

Geometric transforms (rotation, scaling, perspective warp). These change the spatial relationship between the camera and the scene. A 30-degree rotation makes "move right" in the original image correspond to a different direction in the rotated image. If used, the action labels must be transformed correspondingly.
Horizontal/vertical flips. Flipping reverses left-right or up-down semantics. For a single-arm robot, a horizontal flip requires negating the y-component of translational actions. This can work but requires careful implementation.

Action perturbation is a complementary augmentation applied in action space rather than observation space. Small Gaussian noise is added to recorded actions:

â = a + ε, ε \sim N(0, σ 2 I)

This serves two purposes: it smooths the action distribution (reducing overfitting to the exact trajectories in the dataset) and it provides implicit regularization similar to label smoothing in classification. Typical values are σ = 0.01–0.05 in normalized action space. Too much noise degrades the trajectory quality; too little provides no benefit.

Sim-to-Real Data

Simulation offers the tantalizing promise of unlimited data. In a simulator like MuJoCo, Isaac Sim, or PyBullet, a robot can attempt thousands of grasps per hour, with automatic success/failure labels, perfect state information, and no risk of hardware damage. If sim data could substitute for real data, the data bottleneck would dissolve overnight.

The reality is more nuanced. Simulation data can help, but the sim-to-real gap limits its direct utility:

Visual gap. Simulated images look different from real images. Even photorealistic renderers (e.g., Omniverse) produce subtle artifacts in reflections, shadows, material textures, and camera noise that real sensors introduce. Policies trained purely on sim images often fail when deployed with real cameras.
Physics gap. Simulated contact dynamics are approximations. Friction, deformable objects, fluid dynamics, and soft contacts are all simplified in simulation. A policy that learns to grasp objects in MuJoCo may apply forces that are too high or too low on real hardware.
Embodiment gap. The simulated robot is a mathematical model; the real robot has backlash in its gears, cable routing that affects dynamics, and sensors with calibration drift.

Domain randomization

Domain randomization (Tobin et al., 2017) is the most widely used technique for bridging the visual sim-to-real gap. The idea is simple: if you randomize the visual appearance of the simulation extensively enough, the real world becomes just another variation that the policy has implicitly trained on.

Parameters typically randomized:

Object textures, colors, and materials
Table and background textures
Lighting direction, intensity, color temperature, number of lights
Camera position, orientation, field of view, and distortion
Distracting objects (random shapes placed in the scene)
Robot link colors and textures

Domain randomization has been remarkably effective for transferring policies that primarily rely on geometric cues (object positions, shapes) rather than fine-grained appearance features. The seminal work by OpenAI on the Rubik's cube (Akkaya et al., 2019) demonstrated that a dexterous manipulation policy trained entirely in simulation with massive domain randomization could transfer to a real Shadow Hand.

Physics randomization extends the idea to dynamics: randomizing friction coefficients, object masses, joint damping, and motor gains during training. This produces policies that are robust to the specific physics parameters of the real system, which are never known exactly.

💡 Sim-to-real in the VLA era

For VLA models that use pretrained vision encoders (CLIP, SigLIP, DINOv2), the visual sim-to-real gap may be less important. These encoders were trained on billions of real images and have learned to extract semantic features that are robust to visual domain shifts. A VLA model might handle simulated images reasonably well because its vision encoder maps both real and simulated images to similar feature representations — the encoder has never seen the sim images before, but it recognizes the objects and spatial layout. This is an active research direction.

Despite these advances, most state-of-the-art VLA models (RT-2, Octo, OpenVLA) are trained primarily or exclusively on real robot data. Simulation data is used as a supplement, not a replacement. The field's bet is that scaling real data collection (through efforts like OXE and DROID) is more productive than trying to close the sim-to-real gap completely.

Code Examples

Loading RLDS data with TensorFlow Datasets

python

import tensorflow_datasets as tfds
import tensorflow as tf

# Load Bridge V2 dataset in RLDS format
dataset = tfds.load(
    'bridge_dataset',
    split='train',
    data_dir='/path/to/data',
    shuffle_files=True,
)

# Each element is an episode (a sequence of steps)
for episode in dataset.take(1):
    steps = episode['steps']
    for i, step in enumerate(steps):
        obs = step['observation']
        image = obs['image']           # shape: (256, 256, 3), uint8
        state = obs['state']           # shape: (7,), float32 (joint positions)
        action = step['action']        # shape: (7,), float32 (EE deltas)
        lang = step['language_instruction']  # bytes string

        if i == 0:
            print(f"Image shape:  {image.shape}")
            print(f"State shape:  {state.shape}")
            print(f"Action shape: {action.shape}")
            print(f"Instruction:  {lang.numpy().decode()}")
            print(f"Episode length: {len(list(steps))}")
            break

Per-dataset action normalization

python

import numpy as np

class ActionNormalizer:
    """Normalize actions per-dataset using z-score or min-max."""

    def __init__(self, method='z_score'):
        self.method = method
        self.stats = {}

    def fit(self, dataset_name: str, actions: np.ndarray):
        """Compute normalization statistics from training data.

        Args:
            dataset_name: identifier for the dataset
            actions: array of shape (N, action_dim)
        """
        if self.method == 'z_score':
            self.stats[dataset_name] = {
                'mean': actions.mean(axis=0),
                'std': actions.std(axis=0) + 1e-8,  # avoid division by zero
            }
        elif self.method == 'min_max':
            self.stats[dataset_name] = {
                'min': actions.min(axis=0),
                'max': actions.max(axis=0),
            }

    def normalize(self, dataset_name: str, actions: np.ndarray) -> np.ndarray:
        """Normalize actions to standard range."""
        s = self.stats[dataset_name]
        if self.method == 'z_score':
            return (actions - s['mean']) / s['std']
        elif self.method == 'min_max':
            return 2.0 * (actions - s['min']) / (s['max'] - s['min'] + 1e-8) - 1.0

    def denormalize(self, dataset_name: str, actions: np.ndarray) -> np.ndarray:
        """Convert normalized actions back to original scale (for inference)."""
        s = self.stats[dataset_name]
        if self.method == 'z_score':
            return actions * s['std'] + s['mean']
        elif self.method == 'min_max':
            return (actions + 1.0) / 2.0 * (s['max'] - s['min'] + 1e-8) + s['min']

# Example usage
normalizer = ActionNormalizer(method='z_score')

# Fit on training data from two datasets
bridge_actions = np.random.randn(60000, 7) * 0.02   # small workspace
droid_actions = np.random.randn(76000, 7) * 0.05     # larger workspace

normalizer.fit('bridge_v2', bridge_actions)
normalizer.fit('droid', droid_actions)

# Normalize during training
normalized = normalizer.normalize('bridge_v2', bridge_actions[:10])
print(f"Normalized mean: {normalized.mean(axis=0)}")  # ~0
print(f"Normalized std:  {normalized.std(axis=0)}")    # ~1

# Denormalize during inference
recovered = normalizer.denormalize('bridge_v2', normalized)
print(f"Max recovery error: {np.abs(recovered - bridge_actions[:10]).max():.2e}")

Data augmentation pipeline for robot learning

python

import torch
import torchvision.transforms as T
import torchvision.transforms.functional as TF
import numpy as np

class RobotDataAugmentation:
    """Augmentation pipeline designed for robot manipulation data.

    Key principles:
    - Random crops: safe, most effective single augmentation
    - Color jitter: safe, simulates lighting variation
    - NO geometric transforms (rotation, flip): these change
      the spatial semantics and require action relabeling
    """

    def __init__(
        self,
        input_size: int = 256,
        crop_size: int = 224,
        color_jitter: bool = True,
        gaussian_noise_std: float = 0.01,
        random_erasing_prob: float = 0.1,
    ):
        transforms = [
            T.RandomCrop(crop_size),
        ]
        if color_jitter:
            transforms.append(
                T.ColorJitter(
                    brightness=0.3,
                    contrast=0.3,
                    saturation=0.2,
                    hue=0.05,  # small hue shift to preserve color semantics
                )
            )
        transforms.extend([
            T.ToTensor(),
            T.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225]),
        ])
        if random_erasing_prob > 0:
            transforms.append(
                T.RandomErasing(p=random_erasing_prob, scale=(0.02, 0.1))
            )
        self.transform = T.Compose(transforms)

    def __call__(self, image):
        """Apply augmentations to a PIL Image or numpy array."""
        return self.transform(image)

def augment_action(action: np.ndarray, noise_std: float = 0.02) -> np.ndarray:
    """Add small Gaussian noise to action labels.

    Args:
        action: shape (7,) — [dx, dy, dz, droll, dpitch, dyaw, gripper]
        noise_std: standard deviation of Gaussian noise

    Returns:
        Perturbed action. Gripper dimension is NOT perturbed (binary).
    """
    noise = np.random.normal(0, noise_std, size=action.shape)
    noise[-1] = 0.0  # don't perturb gripper (open/close is binary)
    return action + noise

# Example
aug = RobotDataAugmentation(input_size=256, crop_size=224)
from PIL import Image
dummy_img = Image.fromarray(np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8))
augmented = aug(dummy_img)
print(f"Augmented image shape: {augmented.shape}")  # (3, 224, 224)

action = np.array([0.01, -0.02, 0.005, 0.0, 0.1, -0.05, 1.0])
perturbed = augment_action(action)
print(f"Original gripper: {action[-1]}, Perturbed gripper: {perturbed[-1]}")  # unchanged

Building a cross-embodiment data loader

python

import tensorflow_datasets as tfds
import numpy as np
from typing import Dict, List

class CrossEmbodimentLoader:
    """Load and mix data from multiple robot datasets with per-dataset
    normalization and configurable mixing ratios.

    This is a simplified version of the data pipeline used in
    Octo and RT-X experiments.
    """

    def __init__(
        self,
        dataset_configs: List[Dict],
        batch_size: int = 256,
        image_size: int = 224,
    ):
        """
        Args:
            dataset_configs: list of dicts, each with:
                - name: TFDS dataset name
                - weight: sampling weight (relative frequency)
                - action_stats: dict with 'mean' and 'std' arrays
            batch_size: total batch size (distributed across datasets)
            image_size: resize images to this size
        """
        self.configs = dataset_configs
        self.batch_size = batch_size
        self.image_size = image_size

        # Normalize weights to sum to 1
        total_weight = sum(c['weight'] for c in dataset_configs)
        for c in dataset_configs:
            c['weight'] /= total_weight

    def _process_episode(self, episode, config):
        """Extract observation-action pairs from an episode."""
        steps = episode['steps']
        for step in steps:
            image = tf.image.resize(
                step['observation']['image'],
                (self.image_size, self.image_size)
            )
            image = tf.cast(image, tf.float32) / 255.0

            # Per-dataset action normalization
            action = step['action']
            action = (action - config['action_stats']['mean']) / \
                     (config['action_stats']['std'] + 1e-8)

            yield {
                'image': image,
                'action': action,
                'language': step['language_instruction'],
                'dataset_id': config['name'],
            }

    def sample_batch(self):
        """Sample a mixed batch according to dataset weights.

        In practice, this uses tf.data.Dataset.sample_from_datasets()
        for efficient interleaving. Simplified here for clarity.
        """
        # Compute per-dataset batch sizes
        sizes = {}
        remaining = self.batch_size
        for i, config in enumerate(self.configs):
            if i == len(self.configs) - 1:
                sizes[config['name']] = remaining
            else:
                n = int(self.batch_size * config['weight'])
                sizes[config['name']] = n
                remaining -= n

        print("Batch composition:")
        for name, size in sizes.items():
            print(f"  {name}: {size} samples "
                  f"({size/self.batch_size*100:.1f}%)")
        return sizes

# Example configuration matching RT-X mixing ratios
configs = [
    {
        'name': 'bridge_dataset',
        'weight': 0.25,
        'action_stats': {
            'mean': np.zeros(7),
            'std': np.array([0.02, 0.02, 0.02, 0.05, 0.05, 0.05, 0.5])
        }
    },
    {
        'name': 'fractal20220817_data',  # Google Robot (RT-1)
        'weight': 0.40,
        'action_stats': {
            'mean': np.zeros(7),
            'std': np.array([0.05, 0.05, 0.05, 0.1, 0.1, 0.1, 0.5])
        }
    },
    {
        'name': 'kuka',
        'weight': 0.15,
        'action_stats': {
            'mean': np.zeros(7),
            'std': np.array([0.03, 0.03, 0.03, 0.08, 0.08, 0.08, 0.5])
        }
    },
    {
        'name': 'taco_play',
        'weight': 0.20,
        'action_stats': {
            'mean': np.zeros(7),
            'std': np.array([0.04, 0.04, 0.04, 0.06, 0.06, 0.06, 0.5])
        }
    },
]

loader = CrossEmbodimentLoader(configs, batch_size=256)
loader.sample_batch()

References

Seminal papers and key works referenced in this article.

Padalkar et al. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA, 2024. arXiv
Khazatsky et al. "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset." RSS, 2024. arXiv
Walke et al. "BridgeData V2: A Dataset for Robot Learning at Scale." CoRL, 2023. arXiv
Zhao et al. "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." RSS, 2023. arXiv