Introduction

The trajectory from research prototype to real-world deployment is among the hardest transitions in all of engineering. For VLA models, this transition is especially challenging because the system must operate in an open world — encountering objects, environments, and situations that were never present in its training data — while making physical contact with that world at every timestep.

A language model that produces a bad sentence wastes the user's time. A self-driving system that makes a bad decision can cause a crash. A manipulation policy that fails can break objects, damage the robot, or injure a nearby human. The stakes of embodied AI are categorically different from those of purely digital AI, and the field is grappling with this reality as VLA models move from carefully instrumented research labs toward messier, more consequential real-world settings.

This article surveys the full landscape: robustness challenges that emerge in unstructured environments, safety frameworks for ensuring learned policies do not cause harm, frontier research in dexterous manipulation and humanoid robotics, the push toward generalist policies that work across tasks and embodiments, multi-robot systems that learn collectively, and the industry ecosystem investing billions of dollars in the belief that foundation models for physical action are within reach.

ℹ What this article covers

Real-world robustness and failure recovery, safety standards (ISO 10218, ISO/TS 15066), dexterous manipulation with multi-fingered hands, humanoid platforms (Figure, Optimus, 1X Neo), generalist policies and scaling laws, multi-robot fleet learning, mobile manipulation, world models, the data flywheel, and the industry landscape from Google DeepMind to Physical Intelligence. With code examples for safety constraints, multi-robot task allocation, and deployment inference pipelines.

Real-World Robustness

Lab environments are designed for reproducibility. Lighting is consistent. Objects come from a curated set. Cameras are calibrated and fixed. The table surface is uniform. These controlled conditions inflate success rates — a policy that achieves 95% in the lab may drop to 50% or lower when deployed in a kitchen with afternoon sunlight, cluttered countertops, and a child's toys scattered among the target objects.

Novel objects and scenes

The central challenge of open-world deployment is out-of-distribution generalization. Training datasets for robot manipulation — even large ones like Open X-Embodiment's 1M+ episodes — cover a vanishingly small fraction of the objects and configurations a deployed robot will encounter.

VLA models have an advantage here over classical pipelines: the pretrained vision-language backbone has seen billions of image-text pairs covering an enormous visual vocabulary. RT-2 demonstrated that this semantic grounding enables meaningful zero-shot generalization. When asked to "pick up the extinct animal," the model correctly identified and grasped a plastic dinosaur — an object category never seen during robot training, but present in the VLM's pretraining data.

However, semantic recognition is necessary but not sufficient. The robot must also infer viable grasp strategies for novel objects. A pretrained VLM knows what a whisk looks like but has no prior about whether to grasp it by the handle, the wires, or the loop at the top. This grasp-relevant geometric reasoning must be learned from manipulation experience or inferred from shape priors, and it remains a major bottleneck for deployment.

Environmental variation

Real environments vary along multiple axes that interact in complex ways:

Variation TypeLab ConditionReal-World ConditionImpact on Policy
LightingControlled overheadWindows, shadows, flickeringFeature extraction shifts, depth noise
Camera poseFixed, calibratedBumped, slightly movedSpatial predictions misaligned
BackgroundUniform tableCluttered, patternedObject segmentation failures
Object stateClean, undamagedDirty, worn, wetAppearance distribution shift
Human presenceNo humansHands reaching in, occludingMust detect and respect humans
DynamicsKnown surfacesVaried friction, complianceAction effects differ from training

Data augmentation during training can mitigate some of these shifts. Color jitter, random cropping, and camera pose randomization in simulation are standard. But the augmentation must be calibrated carefully: too aggressive and the policy loses precision on nominal conditions; too conservative and it fails on the first environmental deviation it encounters.

💡 The deployment gap is multiplicative

Each source of variation may reduce success rates by 5–15% independently. But in real deployments, multiple shifts occur simultaneously — new lighting and a moved camera and a cluttered background. The compounding effect can be dramatic. Brohan et al. (2023) reported that RT-2 maintained 62% success in a novel kitchen vs. 97% in its training kitchen, and noted that most failures involved combinations of distribution shifts rather than any single one.

Failure modes and recovery

In lab evaluations, a failed grasp ends the trial. In deployment, the robot must detect its own failure and recover from it. This requires capabilities that most current VLA models lack:

  • Failure detection. Did the grasp succeed? Is the object in the gripper? Force/torque sensors, gripper encoder readings, and visual verification can signal failure, but integrating these signals into the policy's decision-making is non-trivial. Inner Monologue (Huang et al., 2022) demonstrated using VLM-based visual verification to detect execution failures and trigger replanning.
  • Recovery strategies. After a failed grasp, the robot must decide whether to retry from the current state, reposition, or abort. Simple retry logic (attempt the same action again) works surprisingly often — many failures are stochastic — but repeated failures on the same object require strategy adaptation: approach from a different angle, use a different grasp type, or ask a human for help.
  • Graceful degradation. When a subtask fails irrecoverably, the system should not halt entirely. A household robot that cannot open a stuck jar should set it aside and continue with other tasks, reporting the failure. This requires task-level planning that can reason about partial completion and alternative plans.

SayCan (Ahn et al., 2022) and subsequent work have shown that LLM-based planners can provide this outer loop of failure detection and replanning, calling VLA primitives as skills and monitoring their success. The key insight is that the VLA policy handles low-level execution while a higher-level system handles task-level reasoning about what to do when execution goes wrong.

Safety for Robot Learning

Learned policies are fundamentally different from programmed motion sequences. A programmed robot follows a deterministic trajectory that can be verified offline; a learned policy generates actions from neural network inference, making its behavior difficult to predict or formally verify. This creates genuine safety challenges that the field must address before widespread deployment.

Safe exploration

During training — especially reinforcement learning — the robot must explore novel actions to discover effective strategies. Unconstrained exploration is dangerous: a policy optimizing for a reward signal has no intrinsic understanding that slamming an arm into a table at maximum velocity is undesirable unless the reward function explicitly penalizes it.

Several approaches constrain exploration to a safe set:

  • Constrained policy optimization. Add hard constraints to the optimization problem: maximum joint velocities, workspace boundaries, force limits. The policy can explore freely within this constrained action space but cannot violate safety limits regardless of what the reward gradient suggests. Methods like Constrained Policy Optimization (CPO; Achiam et al., 2017) formalize this as a constrained MDP.
  • Safety layers. A separate safety module sits between the policy output and the robot actuators. If the proposed action would violate a constraint (enter a forbidden workspace region, exceed a force threshold), the safety layer projects the action to the nearest safe action. This is computationally simple and provides hard guarantees.
  • Sim-to-real with safety verification. Train in simulation where safety violations have no real-world consequences, then verify the policy's behavior in a constrained real-world test before full deployment. Domain randomization ensures the simulated policy has seen worse conditions than reality.
  • Human-in-the-loop. A human operator monitors the robot during early deployment and can intervene via an emergency stop. DAgger-style approaches use these interventions as corrective demonstrations, simultaneously improving safety and performance.
Safety-constrained policy optimization:

maxπ Eτ~π[R(τ)]    subject to    Ci(π) ≤ di    for i = 1, ..., k

where Ci are cost functions (e.g., collision force, workspace violation)
and di are safety thresholds.
Safety Envelope — Workspace Limits & Force Constraints Interactive

Move your mouse to control the robot end-effector. The safety layer constrains motion to the safe workspace and limits applied force. Red zones indicate forbidden regions.

Workspace mode — end-effector constrained to safe region

Standards and compliance

Industrial robot safety is governed by well-established standards, but learned policies challenge the assumptions underlying these frameworks:

  • ISO 10218-1/2 (2011). The primary international standard for industrial robot safety. Defines safety-rated monitored stop, hand guiding, speed and separation monitoring, and power and force limiting as collaborative operation modes. Originally designed for programmed robots with predictable trajectories, not learned policies with stochastic behavior.
  • ISO/TS 15066 (2016). Technical specification for collaborative robot systems. Defines maximum permissible force and pressure for different body regions during transient and quasi-static contact. For example, transient contact on the hand is limited to 280N force and 360 N/cm² pressure. These limits apply regardless of whether actions come from a programmed trajectory or a neural network.
  • Collision detection. Modern collaborative robots (cobots) include joint torque sensors that detect unexpected contact. When external torque exceeds a threshold, the robot stops within milliseconds. For learned policies, the challenge is distinguishing intentional contact (grasping an object, pushing a button) from unintended collision (hitting a human's arm). Current approaches use a combination of expected vs. actual force monitoring and visual human detection.
Body RegionMax Force (Transient)Max Force (Quasi-static)Max Pressure (Quasi-static)
Skull / forehead130 N65 N110 N/cm²
Face65 N45 N110 N/cm²
Chest140 N70 N170 N/cm²
Abdomen110 N55 N140 N/cm²
Hand / fingers280 N140 N360 N/cm²
Lower legs210 N105 N250 N/cm²

Source: ISO/TS 15066:2016, Table A.2. Maximum permissible biomechanical limits for collaborative robot systems.

Ethical considerations

Beyond physical safety, deploying learned robotic systems raises ethical questions that the field is only beginning to address:

  • Labor displacement. Robots performing household tasks, warehouse logistics, or food service directly compete with human workers. The economic benefits of automation are unevenly distributed, and deployment decisions should consider workforce impacts.
  • Bias in training data. If training demonstrations come primarily from certain environments (e.g., Western-style kitchens), deployed policies may perform poorly in other contexts. This is analogous to bias in language models but with physical consequences — a robot that cannot handle chopsticks because it was only trained with forks is not merely culturally insensitive; it is functionally deficient for a large fraction of the global population.
  • Autonomy and accountability. When a learned policy causes damage, who is responsible? The deployer, the model developer, the data collectors, or the robot itself? Current legal frameworks assume programmed, deterministic behavior. Learned policies that produce emergent, hard-to-predict actions challenge existing liability structures.
  • Surveillance potential. Robots with cameras and microphones in homes, hospitals, and workplaces create surveillance risks. Data collected for policy improvement (recording failures to learn from them) may capture sensitive information about occupants.

Dexterous Manipulation

Most VLA research uses parallel-jaw grippers — simple two-finger devices with a single degree of freedom (open/close). This dramatically simplifies the action space but limits the range of manipulations the robot can perform. You cannot turn a key, flip a pancake, or tie a knot with a parallel-jaw gripper.

Dexterous manipulation with multi-fingered hands is one of the most active and challenging frontiers in robotics. The action space explodes (a human hand has ~27 degrees of freedom), contact dynamics are extraordinarily complex, and the sim-to-real gap for contact-rich manipulation is particularly severe.

Multi-fingered hands

Several hardware platforms are driving dexterous manipulation research:

Research Platform

Allegro Hand

16 DoF, 4 fingers, torque-controlled. The most widely used research hand. Used in DexPoint, DexGraspNet, and numerous sim-to-real studies. Cost: ~$15k.

Research Platform

Shadow Dexterous Hand

24 DoF, 5 fingers including opposable thumb. Closest to human hand kinematics. Used in OpenAI's Rubik's cube work. Cost: ~$100k. Tendon-driven actuation.

Research Platform

LEAP Hand

16 DoF, 4 fingers, fully 3D-printable and open-source. Developed at CMU. Cost: ~$2k in parts. Dramatically lowers the barrier to dexterous manipulation research.

Production Platform

Ability Hand (PSYONIC)

6 DoF, 5 fingers. Designed as a prosthetic but increasingly used in robotics. Compliant fingertips with touch sensing. Robust to impacts.

The fundamental challenge of dexterous manipulation is the contact-rich dynamics. When a parallel-jaw gripper grasps an object, there are at most two contact points. A multi-fingered hand manipulating a pen may have 5–10 simultaneous contact points, each exerting forces and torques that interact through the object. Small changes in finger position produce discontinuous changes in contact state (a finger makes or breaks contact), making the dynamics non-smooth and difficult to model or simulate accurately.

Dexterous manipulation action space vs. parallel-jaw gripper:

Parallel jaw: a ∈ ℝ7 (6-DoF pose + 1 gripper)
Allegro hand: a ∈ ℝ22 (6-DoF wrist + 16 finger joints)
Shadow hand: a ∈ ℝ30 (6-DoF wrist + 24 finger joints)

Combinatorial complexity grows exponentially with DoF.

Recent advances

Despite the difficulty, several recent works have demonstrated impressive dexterous manipulation:

  • DexPoint (Qin et al., 2023). Learns dexterous grasping from point cloud observations using RL in simulation. Transfers to real Allegro hand with minimal sim-to-real gap by operating on 3D point clouds rather than images, which are more invariant to visual domain shift.
  • DexGraspNet (Wang et al., 2023). A large-scale dataset of 1.32 billion dexterous grasp poses across 5,355 objects, generated through analytical grasp optimization. Enables training grasp generation networks that produce diverse, stable grasps for novel objects.
  • General Flow (Yuan et al., 2024). Represents dexterous manipulation as flow fields — dense correspondences between current and goal configurations. Works across different hand morphologies (Allegro, LEAP, Shadow) because the flow representation is defined in 3D space rather than joint space.
  • AnyRotate (Chen et al., 2024). Achieves arbitrary axis in-hand rotation of diverse objects using RL with teacher-student distillation. Sim-to-real transfer with the Allegro hand, rotating objects including spheres, cylinders, and irregular shapes.
  • Rotate Without Seeing (Yin et al., 2023). In-hand rotation using only proprioceptive feedback (no vision), trained with massive domain randomization in simulation. Demonstrates that touch-based policies can be remarkably robust.
💡 The sim-to-real gap is worst for contact

Simulation engines (MuJoCo, Isaac Sim) model contact with simplified physics: point contacts, approximate friction models, rigid body assumptions. Real contact involves deformable surfaces, complex friction (stick-slip transitions, surface textures), and micro-geometry that simulators cannot capture. This makes the sim-to-real gap for dexterous manipulation significantly worse than for free-space motion. The most successful approaches either avoid relying on simulated contact accuracy (using domain randomization to make the policy robust to contact model errors) or operate on representations (point clouds, tactile features) that transfer more cleanly.

Humanoid Robots

The resurgence of humanoid robotics is driven by a simple argument: the world is designed for humans. Doorknobs, stairs, chairs, tools, vehicles — the built environment assumes a roughly human-shaped body with bipedal locomotion and two dexterous hands. A humanoid robot can operate in these environments without modification, while a wheeled robot with a single arm cannot climb stairs, and a quadruped cannot open doors.

This argument has gained force as VLA models demonstrate that learning-based control may finally make humanoids viable. The traditional approach — hand-engineering walking gaits, balance controllers, and manipulation planners — produced brittle systems that worked in controlled demos but failed in unstructured environments. Learning-based approaches promise more robust, adaptive behavior.

Current platforms

PlatformCompanyDoFHeightHandsStatus (2025)
Figure 02Figure AI~401.68m16-DoF eachPilot deployment at BMW
Optimus Gen 2Tesla~281.73m11-DoF eachFactory testing at Tesla
Neo1X Technologies~321.65mSoft actuatorsPre-commercial pilots
DigitAgility Robotics~301.75mSimple grippersDeployed at Amazon warehouse
H1 / G1Unitree23–431.27–1.80mDexterous (G1)Commercial R&D platform
Atlas (Electric)Boston Dynamics~281.50mCustom grippersHyundai factory pilots

VLA for full-body control

Applying VLA architectures to humanoid robots introduces challenges beyond tabletop manipulation:

  • Locomotion + manipulation. The policy must simultaneously control walking (maintaining balance, avoiding obstacles, planning footsteps) and manipulation (reaching, grasping, carrying). These were traditionally separate control problems solved by different subsystems. VLA approaches attempt to solve them jointly, conditioning whole-body behavior on language instructions like "walk to the kitchen and pick up the cup."
  • High-dimensional action spaces. A humanoid with two dexterous hands may have 50+ DoF. Generating 50-dimensional action vectors at 10–50Hz from a VLA model requires either very efficient inference or hierarchical decomposition (VLA for high-level commands, specialized controllers for low-level execution).
  • Balance as a constraint. Every manipulation action creates reaction forces that affect balance. Reaching for a heavy object shifts the center of mass. Pushing a door generates a reaction force. The policy must implicitly maintain the zero-moment-point (ZMP) within the support polygon while performing manipulation tasks.
  • Multi-camera perception. Humanoids typically have head-mounted cameras (for scene understanding and navigation) plus wrist cameras (for close-range manipulation). Fusing these views into a coherent spatial representation is an active research challenge.

Figure AI has demonstrated some of the most advanced VLA-driven humanoid behavior, using a model trained on both language instructions and teleoperated demonstrations to perform multi-step tasks (sorting objects, handing items to humans) with natural-looking whole-body motion. Their approach uses a hierarchical architecture: a VLA backbone generates task-level action targets that are realized by a lower-level whole-body controller maintaining balance and joint limits.

ℹ Humanoid economics

The economic argument for humanoids hinges on versatility. A specialized robot arm costs $30–100k but only does one task. A humanoid at $30–50k (projected at scale by Tesla and 1X) that can perform dozens of tasks in human-designed environments has a much better cost-per-task ratio. The key question is whether VLA models can deliver the task breadth to justify the hardware complexity. Current humanoids can reliably perform 5–20 distinct tasks; economic viability likely requires 50–100+.

Generalist Policies

The ultimate vision of VLA research is a single model — a generalist policy — that can perform any manipulation task on any robot given a natural language instruction. This parallels the trajectory of language models: from task-specific models (one model per task) to general-purpose foundation models (one model for all language tasks).

We are early on this trajectory. Current VLA models generalize across tasks within an embodiment (RT-2 handles hundreds of tasks on the same robot) and are beginning to show cross-embodiment transfer (Octo works across multiple robot platforms). But we are far from a single model that can drive a humanoid, fly a drone, and operate a surgical robot.

Generalist Policy Capabilities — Radar Chart Interactive

Compare capability profiles of different VLA models across task categories. Click a model to see its strengths and weaknesses.

RT-2 — strong semantic grounding, single embodiment

Scaling laws for robot learning

Language models follow well-characterized scaling laws: performance improves predictably with compute, data, and parameters (Kaplan et al., 2020). Whether similar scaling laws hold for robot learning is an open and critical question.

Early evidence is encouraging but limited:

  • RT-2 scaling. Brohan et al. (2023) showed that scaling the VLM backbone from 12B to 55B parameters improved manipulation success rates by 5–15% across tasks, with the largest improvements on tasks requiring semantic reasoning (selecting the correct object from a description). Spatial precision improvements were smaller.
  • Data scaling. The Open X-Embodiment project showed that training on pooled data from 22 robot embodiments improved performance on most constituent datasets compared to training on each dataset individually. But the improvement was not uniform — some embodiments saw degradation, suggesting negative transfer from dissimilar platforms.
  • Emergent capabilities. Like language models, VLA models show emergent capabilities at scale that are absent in smaller models. RT-2-55B could perform rudimentary chain-of-thought reasoning about manipulation sequences ("to get the hidden object, first move the blocking object") that RT-2-12B could not.
55B
Largest VLA (RT-2-X)
1M+
Episodes (Open X)
22
Robot embodiments pooled
Transfer boost (best case)

Zero-shot generalization

The most striking demonstrations of VLA generalization are zero-shot: performing tasks never seen during robot training, purely through the semantic understanding inherited from VLM pretraining.

RT-2 demonstrated several categories of zero-shot generalization:

  • Novel object categories. Grasping objects from categories absent in robot training data but present in web-scale pretraining (e.g., "pick up the Taylor Swift album").
  • Relational reasoning. Following instructions that require understanding relationships: "move the apple to the left of the orange," "stack the smaller block on the larger one." These spatial and comparative concepts were never explicitly supervised during robot training.
  • Abstract category transfer. Responding to prompts like "pick up something you can use to write" (selecting a pen from among distractors) or "pick up the healthiest food" (selecting a fruit over junk food). These require common-sense reasoning that only exists in the language model's knowledge.

However, zero-shot generalization has clear limits. The model can identify what to interact with but may not know how to interact with it. It can select the correct screwdriver but does not know how to use it to drive a screw — that manipulation strategy was never in its training data. Bridging from semantic understanding to motor skill remains the core challenge.

Multi-Robot Coordination

Scaling robot deployment from one arm to fleets of robots introduces coordination challenges but also a powerful advantage: collective experience. Every episode executed by any robot in the fleet generates data that can improve all robots.

Fleet learning

Google's pioneering work on fleet learning (Levine et al., 2018) demonstrated the concept with a fleet of 7–14 robot arms sharing a grasping policy. Key insights:

  • Linear data scaling. With N robots running in parallel, the fleet collects N× more experience per wall-clock hour. Google's grasping experiments collected 800,000 grasps over two months with 14 robots — equivalent to ~5 years of continuous single-robot operation.
  • Diversity through parallelism. Different robots encounter different failure modes (slightly different calibration, different object sets, different physical wear). This natural diversity acts as a form of data augmentation, producing more robust policies than single-robot collection.
  • Centralized training, distributed execution. Experience from all robots is aggregated centrally, a new policy is trained (or the existing one fine-tuned), and the updated policy is deployed to all robots. This train-deploy cycle can run continuously, with policies improving daily or even hourly.

Cooperative manipulation

Some tasks require multiple robots to work together: carrying a large object, assembling a structure, or performing a task that requires more than two hands. Multi-robot manipulation adds coordination challenges:

  • Shared object dynamics. When two robots grasp the same object, their actions are physically coupled through the object. Each robot's motion affects the forces experienced by the other. This requires either explicit coordination (a shared plan) or implicit coordination (reactive policies that respond to force feedback).
  • Communication. Robots must share information about their state, intentions, and observations. Low-latency communication is critical — a 100ms delay in force feedback during cooperative carrying can cause instability. Most current systems use wired Ethernet or dedicated wireless links.
  • Task allocation. In a fleet, which robot should do which task? This is a combinatorial optimization problem that must account for robot capabilities, current positions, battery levels, and task priorities. Centralized allocation is simpler but creates a single point of failure; distributed auction-based approaches are more robust.
ℹ The fleet learning advantage compounds

Fleet learning creates a virtuous cycle: more robots → more data → better policy → higher success rates → more useful deployments → justification for more robots. This data flywheel is why companies like Google DeepMind, Physical Intelligence, and Covariant are investing heavily in fleet deployments, even when individual robots are not yet profitable. The long-term value is in the data and the compounding policy improvements.

Mobile Manipulation

Most VLA research assumes a stationary robot arm. Real-world deployment almost always requires mobile manipulation: a robot that can navigate to a location and then manipulate objects there. Cleaning a house, stocking shelves, delivering packages — these all require both locomotion and manipulation.

Mobile ALOHA (Fu et al., 2024) is the most prominent recent system combining learned mobile manipulation with VLA-style control. Built on a pair of ViperX arms mounted on a mobile base, it uses co-training — mixing a small number of teleoperated demonstrations of the target task with a large dataset of prior mobile manipulation data — to achieve impressive household tasks: wiping a stovetop, opening and closing cabinets, picking up fallen objects, and even cooking simple meals.

The key technical challenges in mobile manipulation are:

  • Coordinate frame management. As the base moves, the camera views and end-effector positions shift in the world frame. The policy must reason about goals in the world frame while executing in the robot frame, requiring accurate localization or learning to be robust to base position uncertainty.
  • Navigation-manipulation handoff. When should the robot stop navigating and start manipulating? This transition is often the most failure-prone phase. Approach too far and the arm cannot reach the target; too close and the arm is in a cramped configuration with limited dexterity.
  • Long-horizon planning. Household tasks involve sequences of navigation and manipulation: go to the fridge, open it, find the milk, grasp it, close the fridge, navigate to the counter, put the milk down. Each step has dependencies on the previous one and the robot must maintain context across minutes of execution.

The Path Forward

The field of VLA research is converging on a vision: foundation models that act — large, pretrained models that can be prompted with any task description and execute it physically, much as GPT-4 can be prompted with any text task and produce a reasonable response.

Several key developments are needed to realize this vision:

Integration with world models

Current VLA models are largely reactive: they map observations directly to actions without explicit internal models of how the world works. A robot pushing a cup has no internal prediction of what will happen if it pushes too hard (the cup falls) or at the wrong angle (the cup slides sideways). It simply acts and observes the result.

World models (Hafner et al., 2019; 2023) learn to predict the consequences of actions in a latent space. Integrating world models with VLA architectures would enable:

  • Mental simulation. Before executing an action, the robot could simulate its effects internally and choose the action with the best predicted outcome. This is model-predictive control with a learned model.
  • Planning without execution. The robot could plan multi-step sequences by rolling out its world model, only executing once it has a plausible plan. This is especially valuable for irreversible actions (pouring liquid, cutting food).
  • Efficient learning. A world model enables learning from imagined experience (Dreamer; Hafner et al., 2020), dramatically reducing the real-world data requirements that currently bottleneck VLA training.

The data flywheel

The most important near-term challenge for VLA research is data. Web-scale text and image data trained the language and vision backbones; but robot interaction data must be collected physically, one episode at a time. The entire Open X-Embodiment dataset (~1M episodes) represents roughly 1,000 hours of robot operation — a tiny fraction of the training data that makes language models powerful.

Three approaches are converging to address this:

  • Simulation at scale. Isaac Sim, MuJoCo MPC, and domain-randomized environments can generate millions of episodes per day. The sim-to-real gap remains, but it is narrowing as simulators improve and transfer techniques mature.
  • Teleoperation scaling. Companies like Physical Intelligence and 1X are building large teleoperation teams (100+ operators) collecting demonstrations across diverse tasks and environments. The cost is high ($50–200 per demonstration hour) but the data quality is excellent.
  • Autonomous collection. Deployed robots that can detect their own failures and request help create a self-improving data flywheel. Each deployment generates experience; experience improves the policy; improved policies succeed more often and generate higher-quality data. This is the endgame: a virtuous cycle where deployment itself solves the data problem.
Technology Readiness — From Lab Demo to Production Interactive

Hover over each technology readiness level to see which VLA capabilities currently reside at each stage. Click to highlight.

Technology readiness overview — most VLA capabilities at TRL 3–5

Open problems that remain unsolved or only partially addressed:

  • Long-horizon reasoning. Current VLA models excel at short-horizon tasks (pick, place, push) but struggle with tasks requiring 50+ sequential decisions. Hierarchical approaches (VLA for primitives, LLM for sequencing) are promising but introduce new failure modes at the interface between levels.
  • Deformable object manipulation. Folding laundry, tying knots, handling food — deformable objects have infinite-dimensional state spaces that current representations cannot fully capture. SpeedFolding (Avigal et al., 2022) achieved 40 folds/hour for towels but remains far from human-level dexterity with arbitrary deformable objects.
  • Tool use. Using a spatula, a screwdriver, or a broom requires understanding that the tool extends the robot's reach and capabilities. Tool use involves contact through an intermediate object, making the control problem significantly harder than direct manipulation.
  • Social awareness. Operating around humans requires understanding social norms, personal space, turn-taking, and non-verbal communication. A robot that efficiently completes a task but makes humans uncomfortable is not successfully deployed.

Industry Landscape

The commercial investment in VLA-adjacent robotics has accelerated dramatically since 2023. Several companies are pursuing the vision of general-purpose robot foundation models, with combined funding exceeding $10 billion.

Industry Landscape — Companies, Focus Areas & Funding Interactive

The VLA/robotics industry landscape mapped by focus area and funding stage. Hover over companies for details. Click to highlight focus area.

All companies — $10B+ total funding in VLA/robotics since 2023
Foundation Models

Google DeepMind

Pioneers of VLA (RT-1, RT-2, RT-X). Operate the largest multi-robot research fleet (~30 arms). Created Open X-Embodiment. Developing Gemini-based robotics foundation models. Deep integration of VLMs with robotic control.

Foundation Models

Physical Intelligence (π)

Founded by Sergey Levine, Karol Hausman, Chelsea Finn and others from Google Brain robotics. Raised $400M+ at $2.4B valuation (2024). Building π0, a generalist VLA model. Focus on cross-embodiment transfer and large-scale data collection.

Industrial Manipulation

Covariant

Founded by Pieter Abbeel. Focus on warehouse automation and logistics. Deployed picking robots at major retailers. RFM-1 — a robotics foundation model trained on years of real-world warehouse manipulation data. Raised $222M.

Humanoid

Figure AI

Building Figure 02 humanoid. Partnership with OpenAI for VLA integration. BMW pilot deployment. Raised $675M at $2.6B valuation (2024). Demonstrated multi-step manipulation with language interaction.

Humanoid

1X Technologies

Norwegian company building Neo humanoid for household tasks. Backed by OpenAI. Raised $125M Series B. Soft actuator design for intrinsic safety. Large-scale teleoperation data collection program (~1M demonstrations targeted).

Humanoid / Industrial

Tesla Bot (Optimus)

Leveraging Tesla's AI infrastructure (Dojo, FSD compute) and manufacturing scale. Internal deployment in Tesla factories. Targeting sub-$20k unit cost at scale. Vertical integration from chips to actuators to software.

💡 The funding landscape signals conviction

The scale of investment is notable: Physical Intelligence ($400M+), Figure AI ($675M+), 1X Technologies ($125M+), Covariant ($222M+), and several others have raised hundreds of millions. This is not seed-stage speculation — it reflects serious institutional conviction (Microsoft, OpenAI, Jeff Bezos, NVIDIA, Intel Capital) that foundation-model-driven robotics will become a massive industry within 5–10 years. The risk is that the technology matures more slowly than the capital deployment, creating a gap between expectations and capabilities.

Code Examples

Safety constraints implementation

A safety layer that sits between the VLA policy and the robot actuators, enforcing workspace limits and force constraints:

python
import numpy as np
from dataclasses import dataclass

@dataclass
class SafetyEnvelope:
    """Defines the safe operating region for a robot arm."""
    # Workspace limits (meters, relative to robot base)
    x_min: float = -0.8;  x_max: float = 0.8
    y_min: float = -0.8;  y_max: float = 0.8
    z_min: float = 0.02;  z_max: float = 0.9  # keep above table

    # Velocity limits (m/s and rad/s)
    max_linear_vel: float = 0.5    # ISO/TS 15066: 250mm/s for collaborative
    max_angular_vel: float = 1.0

    # Force limits (Newtons) — ISO/TS 15066 Table A.2
    max_force_transient: float = 140.0   # chest contact threshold
    max_force_quasistatic: float = 65.0  # face contact threshold (conservative)

    # Keep-out zones: list of (center, radius) spheres
    keepout_zones: list = None

    def __post_init__(self):
        if self.keepout_zones is None:
            self.keepout_zones = []


class SafetyLayer:
    """Projects unsafe actions to the nearest safe action."""

    def __init__(self, envelope: SafetyEnvelope, dt: float = 0.05):
        self.env = envelope
        self.dt = dt
        self._violation_count = 0

    def check_action(self, current_pos: np.ndarray,
                     action: np.ndarray) -> tuple[np.ndarray, dict]:
        """
        Takes proposed action, returns safe action and diagnostics.

        Args:
            current_pos: (3,) current end-effector position [x, y, z]
            action: (7,) [dx, dy, dz, drx, dry, drz, gripper]

        Returns:
            safe_action: (7,) projected to safe space
            info: dict with violation details
        """
        safe = action.copy()
        info = {"violations": [], "original": action.copy()}

        # 1. Velocity clamping
        linear_vel = np.linalg.norm(safe[:3]) / self.dt
        if linear_vel > self.env.max_linear_vel:
            scale = self.env.max_linear_vel * self.dt / np.linalg.norm(safe[:3])
            safe[:3] *= scale
            info["violations"].append(f"linear_vel: {linear_vel:.3f} > {self.env.max_linear_vel}")

        angular_vel = np.linalg.norm(safe[3:6]) / self.dt
        if angular_vel > self.env.max_angular_vel:
            scale = self.env.max_angular_vel * self.dt / np.linalg.norm(safe[3:6])
            safe[3:6] *= scale
            info["violations"].append(f"angular_vel: {angular_vel:.3f} > {self.env.max_angular_vel}")

        # 2. Workspace boundary enforcement
        next_pos = current_pos + safe[:3]
        next_pos[0] = np.clip(next_pos[0], self.env.x_min, self.env.x_max)
        next_pos[1] = np.clip(next_pos[1], self.env.y_min, self.env.y_max)
        next_pos[2] = np.clip(next_pos[2], self.env.z_min, self.env.z_max)
        safe[:3] = next_pos - current_pos

        if not np.allclose(safe[:3], action[:3]):
            info["violations"].append("workspace_boundary")

        # 3. Keep-out zone avoidance
        for center, radius in self.env.keepout_zones:
            dist = np.linalg.norm(next_pos - center)
            if dist < radius:
                # Push to nearest point on sphere surface
                direction = (next_pos - center) / (dist + 1e-8)
                next_pos = center + direction * (radius + 0.01)
                safe[:3] = next_pos - current_pos
                info["violations"].append(f"keepout_zone: {center}")

        if info["violations"]:
            self._violation_count += 1

        info["safe"] = len(info["violations"]) == 0
        info["total_violations"] = self._violation_count
        return safe, info

    def check_force(self, measured_force: float,
                    contact_type: str = "transient") -> bool:
        """Returns True if force is within safe limits."""
        limit = (self.env.max_force_transient if contact_type == "transient"
                 else self.env.max_force_quasistatic)
        return measured_force <= limit


# ── Usage ────────────────────────────────────────────────
envelope = SafetyEnvelope(
    keepout_zones=[
        (np.array([0.3, 0.0, 0.5]), 0.15),  # human detected here
    ]
)
safety = SafetyLayer(envelope, dt=0.05)

# Policy proposes an action
policy_action = np.array([0.05, 0.02, -0.03, 0.0, 0.0, 0.1, 0.8])
current_pos = np.array([0.4, 0.1, 0.3])

safe_action, info = safety.check_action(current_pos, policy_action)
print(f"Safe: {info['safe']}, Violations: {info['violations']}")

Multi-robot task allocation

A simple auction-based task allocator for a fleet of robots with different capabilities:

python
import numpy as np
from dataclasses import dataclass, field

@dataclass
class Robot:
    id: str
    position: np.ndarray          # (3,) current position
    capabilities: set             # e.g., {"pick", "place", "pour", "navigate"}
    battery: float = 1.0          # 0..1
    current_task: str = None

@dataclass
class Task:
    id: str
    position: np.ndarray          # (3,) task location
    required_capabilities: set    # capabilities needed
    priority: float = 1.0         # higher = more urgent
    estimated_duration: float = 60.0  # seconds


class AuctionAllocator:
    """Auction-based multi-robot task allocation.

    Each robot "bids" on each task based on:
    - Distance to task (travel cost)
    - Capability match
    - Battery level
    - Task priority

    Tasks are assigned to the lowest-cost capable robot.
    """

    def __init__(self, distance_weight: float = 1.0,
                 battery_weight: float = 0.5):
        self.w_dist = distance_weight
        self.w_batt = battery_weight

    def compute_bid(self, robot: Robot, task: Task) -> float:
        """Lower bid = better match. Returns inf if robot cannot do task."""
        # Check capability match
        if not task.required_capabilities.issubset(robot.capabilities):
            return float('inf')

        # Check battery sufficiency (need enough for travel + task + return)
        travel_dist = np.linalg.norm(robot.position - task.position)
        estimated_battery_cost = (travel_dist * 0.01 +
                                  task.estimated_duration * 0.001)
        if estimated_battery_cost > robot.battery * 0.8:  # 20% reserve
            return float('inf')

        # Already busy
        if robot.current_task is not None:
            return float('inf')

        # Compute bid: weighted combination of costs
        cost = (self.w_dist * travel_dist +
                self.w_batt * (1.0 - robot.battery))

        # Discount by task priority (high priority = lower effective cost)
        cost /= task.priority
        return cost

    def allocate(self, robots: list[Robot],
                 tasks: list[Task]) -> dict[str, str]:
        """Assign tasks to robots via sequential auction.

        Returns: dict mapping task_id -> robot_id
        """
        assignments = {}
        available = set(r.id for r in robots if r.current_task is None)
        robot_map = {r.id: r for r in robots}

        # Sort tasks by priority (highest first)
        sorted_tasks = sorted(tasks, key=lambda t: t.priority, reverse=True)

        for task in sorted_tasks:
            best_bid = float('inf')
            best_robot = None

            for rid in available:
                bid = self.compute_bid(robot_map[rid], task)
                if bid < best_bid:
                    best_bid = bid
                    best_robot = rid

            if best_robot is not None:
                assignments[task.id] = best_robot
                available.discard(best_robot)
                robot_map[best_robot].current_task = task.id

        return assignments


# ── Usage ────────────────────────────────────────────────
robots = [
    Robot("arm_1", np.array([0, 0, 0]),    {"pick", "place", "pour"}),
    Robot("arm_2", np.array([2, 0, 0]),    {"pick", "place"}),
    Robot("mobile_1", np.array([1, 3, 0]), {"pick", "place", "navigate"}),
]

tasks = [
    Task("fetch_cup", np.array([1, 1, 0]),
         {"pick", "navigate"}, priority=2.0),
    Task("sort_items", np.array([0.5, 0, 0]),
         {"pick", "place"}, priority=1.0),
    Task("pour_water", np.array([0, 0.5, 0]),
         {"pick", "pour"}, priority=1.5),
]

allocator = AuctionAllocator()
assignments = allocator.allocate(robots, tasks)
for task_id, robot_id in assignments.items():
    print(f"  {task_id} -> {robot_id}")

Deployment inference pipeline

A production-oriented inference pipeline for deploying a VLA model on a real robot, with safety checks, latency monitoring, and graceful failure handling:

python
import time
import numpy as np
from collections import deque
from dataclasses import dataclass

@dataclass
class InferenceStats:
    """Tracks deployment inference performance."""
    latencies: deque = None    # recent inference times (ms)
    success_count: int = 0
    failure_count: int = 0
    safety_interventions: int = 0

    def __post_init__(self):
        if self.latencies is None:
            self.latencies = deque(maxlen=100)

    @property
    def avg_latency_ms(self) -> float:
        return np.mean(self.latencies) if self.latencies else 0.0

    @property
    def p99_latency_ms(self) -> float:
        return np.percentile(list(self.latencies), 99) if self.latencies else 0.0

    @property
    def success_rate(self) -> float:
        total = self.success_count + self.failure_count
        return self.success_count / total if total > 0 else 0.0


class VLADeploymentPipeline:
    """Production inference pipeline for VLA model deployment."""

    def __init__(self, model, safety_layer, control_freq_hz: float = 10.0,
                 max_latency_ms: float = 80.0, max_consecutive_failures: int = 5):
        self.model = model
        self.safety = safety_layer
        self.control_dt = 1.0 / control_freq_hz
        self.max_latency = max_latency_ms
        self.max_failures = max_consecutive_failures
        self.stats = InferenceStats()
        self._consecutive_failures = 0

    def preprocess_observation(self, obs: dict) -> dict:
        """Normalize and validate sensor inputs."""
        processed = {}

        # Validate image dimensions
        for cam_name in ["wrist_cam", "overhead_cam"]:
            if cam_name in obs:
                img = obs[cam_name]
                assert img.shape[-1] == 3, f"Expected RGB, got {img.shape}"
                # Normalize to [0, 1]
                processed[cam_name] = img.astype(np.float32) / 255.0

        # Validate proprioception
        if "joint_positions" in obs:
            jp = obs["joint_positions"]
            assert not np.any(np.isnan(jp)), "NaN in joint positions"
            processed["joint_positions"] = jp

        # Language instruction (tokenized upstream)
        if "instruction" in obs:
            processed["instruction"] = obs["instruction"]

        return processed

    def run_step(self, observation: dict,
                 current_ee_pos: np.ndarray) -> tuple[np.ndarray, dict]:
        """Execute one control step: observe -> infer -> safety check -> act.

        Returns (action, step_info) or raises if unrecoverable.
        """
        step_info = {"timestamp": time.time()}

        # 1. Preprocess
        try:
            processed_obs = self.preprocess_observation(observation)
        except (AssertionError, ValueError) as e:
            self._consecutive_failures += 1
            step_info["error"] = f"preprocess: {e}"
            return np.zeros(7), step_info  # zero action = hold position

        # 2. Model inference with latency tracking
        t0 = time.perf_counter()
        try:
            raw_action = self.model.predict(processed_obs)
        except Exception as e:
            self._consecutive_failures += 1
            step_info["error"] = f"inference: {e}"
            return np.zeros(7), step_info

        latency_ms = (time.perf_counter() - t0) * 1000
        self.stats.latencies.append(latency_ms)
        step_info["latency_ms"] = latency_ms

        # 3. Latency check — if too slow, use previous action or hold
        if latency_ms > self.max_latency:
            step_info["warning"] = f"latency {latency_ms:.1f}ms > {self.max_latency}ms"

        # 4. Safety projection
        safe_action, safety_info = self.safety.check_action(
            current_ee_pos, raw_action
        )
        step_info["safety"] = safety_info

        if not safety_info["safe"]:
            self.stats.safety_interventions += 1

        # 5. Consecutive failure tracking
        if safety_info["safe"] and "error" not in step_info:
            self._consecutive_failures = 0
            self.stats.success_count += 1
        else:
            self._consecutive_failures += 1
            self.stats.failure_count += 1

        # 6. Emergency stop if too many consecutive failures
        if self._consecutive_failures >= self.max_failures:
            step_info["emergency_stop"] = True
            raise RuntimeError(
                f"{self.max_failures} consecutive failures — requesting human intervention. "
                f"Stats: {self.stats.avg_latency_ms:.1f}ms avg latency, "
                f"{self.stats.success_rate:.1%} success rate"
            )

        return safe_action, step_info

    def get_report(self) -> str:
        """Human-readable deployment performance summary."""
        return (
            f"VLA Deployment Report\n"
            f"  Avg latency:    {self.stats.avg_latency_ms:.1f} ms\n"
            f"  P99 latency:    {self.stats.p99_latency_ms:.1f} ms\n"
            f"  Success rate:   {self.stats.success_rate:.1%}\n"
            f"  Safety stops:   {self.stats.safety_interventions}\n"
            f"  Total steps:    {self.stats.success_count + self.stats.failure_count}"
        )

References

Seminal papers and key works referenced in this article.

  1. Zhao et al. "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." RSS, 2023. arXiv
  2. Fu et al. "Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation." CoRL, 2024. arXiv
  3. Chen et al. "DexPoint: Generalizable Point Cloud Reinforcement Learning for Sim-to-Real Dexterous Manipulation." CoRL, 2023. arXiv
  4. Black et al. "pi_0: A Vision-Language-Action Flow Model for General Robot Control." 2024. arXiv
  5. Brohan et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL, 2023. arXiv