Scale real-world robot data to 20,000 hours across 9 dual-arm embodiments. No saturation in sight. LingBot-VLA combines a VLM understanding expert with a flow-matching action expert, evaluated on 100 tasks across 3 platforms.
You want a robot that can fold towels, peel lemons, disassemble puzzle locks, and arrange flowers. Not in simulation — on a real physical table, with real grippers, handling real objects that slip, squish, and break.
Vision-Language-Action (VLA) models promise exactly this: feed in camera images and a language instruction, get back motor commands. Models like π0 and GR00T have made impressive progress. But a fundamental question remains unanswered:
The challenge is threefold. First, collecting real robot data is brutally expensive. Each hour requires a physical robot, a human teleoperator, and careful quality control. Second, training on massive datasets demands an efficient codebase — existing VLA frameworks bottleneck at data I/O and communication overhead. Third, evaluation is fragmented: most papers test on a handful of tasks with a single robot, making comparisons unreliable.
LingBot-VLA attacks all three problems simultaneously. They collected 20,000 hours of real teleoperation data from 9 different dual-arm robots, built a training codebase that processes 261 samples per second on 8 GPUs, and evaluated on 100 tasks across 3 platforms with 130 episodes each.
The result? Success rates climb steadily from 3,000 to 20,000 hours of pre-training data with no sign of saturation. More real data keeps helping. Drag the slider below to see the trend.
Drag the slider to see how success rate changes as pre-training data grows from 3K to 20K hours. The curve keeps rising — no plateau.
LingBot-VLA's philosophy is refreshingly pragmatic: scale real data, not architecture complexity.
Many recent VLA papers compete on clever architectural innovations — spatial representations, novel attention patterns, specialized pre-training objectives. LingBot-VLA bets on a simpler thesis: a clean architecture trained on a truly massive and diverse real-world dataset will outperform fancier models trained on less data.
The second key insight is about evaluation rigor. Most VLA papers evaluate on 5-20 tasks with a single robot. LingBot-VLA uses the GM-100 benchmark: 100 tasks, 3 platforms, 130 post-training episodes per task per platform. This is 22,500 total evaluation trials — enough to draw statistically meaningful conclusions.
The third insight is data efficiency. With strong pre-training, LingBot-VLA needs fewer demonstrations to adapt to new tasks. With only 80 demonstrations per task, it already outperforms π0.5 using the full 130-demonstration budget. Good pre-training is a force multiplier.
Twenty thousand hours of robot data does not appear out of thin air. LingBot-VLA's dataset was collected across 9 different dual-arm robot embodiments, each with its own kinematic configuration, gripper design, and camera setup.
| Robot | Arms | Cameras | Teleoperation |
|---|---|---|---|
| AgiBot G1 | 2 × 7-DoF | 3 RGB-D | VR-based |
| AgileX | 2 × 6-DoF | 3 cameras | Isomorphic arms |
| Galaxea R1Lite | 2 × 6-DoF | 1 stereo + 2 wrist | Teleoperation |
| Galaxea R1Pro | 2 × 7-DoF | 1 stereo + 2 wrist | Teleoperation |
| Realman RS-02 | 2 × 7-DoF | 3 cameras | Teleoperation |
| Leju KUAVO 4 Pro | 2 × 7-DoF | 1 head + 2 wrist | Humanoid |
| Qinglong | 2 × 7-DoF | 1 head + 2 wrist | Humanoid |
| ARX Lift2 | 2 × 6-DoF | 3 cameras | Teleoperation |
| Bimanual Franka | 2 × 7-DoF | 3 cameras | Teleoperation |
Notice the diversity: arm degrees of freedom range from 6 to 7, camera configurations vary from 2 to 3+, and teleoperation methods differ (VR, isomorphic arms, direct). This diversity is intentional — the model must learn manipulation skills that transfer across embodiment differences.
Raw teleoperation videos are not directly useful for training. They need language instructions that describe what the robot is doing. LingBot-VLA uses a two-stage annotation process:
The word cloud of atomic actions in the dataset reveals over 150 distinct verbs: move, pick, fold, pour, open, release, insert, arrange, stir, stack, rotate, peel, and many more. This action vocabulary is far richer than what most VLA datasets provide.
Here is a hard problem: the 9 robot platforms have different numbers of joints, different gripper types, and different control interfaces. A 6-DoF arm and a 7-DoF arm produce action vectors of different sizes. How do you train a single model on all of them?
LingBot-VLA uses continuous actions in a unified format. Each action step specifies joint positions (or velocities) for both arms plus gripper commands. For robots with fewer degrees of freedom, unused dimensions are masked or zero-padded. The key insight: despite different kinematic chains, all these dual-arm robots perform similar manipulation tasks (grasping, placing, pouring), so a shared action representation can capture the common structure.
Instead of predicting one action at a time, LingBot-VLA predicts an action chunk — a sequence of T consecutive actions:
During pre-training, T = 50. This means the model outputs the next 50 time steps of robot motion in one forward pass.
The action chunks are generated via Flow Matching, not autoregressive prediction. Starting from Gaussian noise, the model iteratively refines the action chunk toward a coherent trajectory. The training objective is:
Where At,s = sAt + (1-s)ε is a linear interpolation between noise ε and the ground-truth action chunk At, and vθ is the learned velocity field.
Think of it this way: the model learns to "push" random noise toward real trajectories. At inference, you start with noise and follow the learned velocity field to produce a smooth action sequence.
LingBot-VLA has a clean two-expert design: an understanding expert that processes visual and language inputs, and an action expert that generates motor commands. They communicate through shared self-attention.
The understanding expert is a pre-trained Vision-Language Model (Qwen2.5-VL). It takes three inputs:
The VLM encodes these into a rich multimodal representation. It understands what objects are present, what the task requires, and what the current robot configuration looks like.
The action expert is a separate transformer pathway that receives the robot's proprioceptive state and produces action chunks via flow matching. It has its own feed-forward layers and embeddings, but shares self-attention with the understanding expert.
The two experts are woven together using a Mixture-of-Transformers architecture (inspired by BAGEL). In each layer:
The joint sequence [Ot, At] is split into three blocks:
This causal structure ensures the action expert has access to all observation context, but future actions cannot leak information backward into the observation representations.
For tasks requiring precise spatial awareness, LingBot-VLA adds a depth distillation mechanism. Learnable queries [Q1, Q2, Q3] are processed through the VLM and then aligned with depth tokens from LingBot-Depth via a distillation loss:
This infuses geometric information without requiring depth at inference time — the VLM learns to internalize spatial reasoning during training.
The understanding expert (VLM) processes images and language. The action expert generates motor commands. They share self-attention but have separate feed-forward networks.
Training a VLA on 20,000 hours of data is a serious engineering challenge. Without an efficient pipeline, a single training run could take months. LingBot-VLA achieves 261 samples per second on 8 GPUs — 1.5 to 2.8 times faster than existing codebases.
LingBot-VLA uses FSDP (Fully Sharded Data Parallel), PyTorch's implementation of the ZeRO optimizer. This shards optimizer states, model parameters, and gradients across GPUs to minimize memory footprint.
But here is the clever part: they construct specific shard groups exclusively for the action expert modules, inspired by Hybrid Sharded Data Parallel (HSDP). The VLM backbone is large but its gradients are well-behaved. The action expert is smaller but its parameters are updated more aggressively. By sharding them differently, they reduce communication overhead without sacrificing memory efficiency.
The multimodal fusion of vision, language, and action is fundamentally a sparse attention process — not all tokens need to attend to all other tokens. LingBot-VLA exploits this with two techniques:
Reductions (gradient aggregation across GPUs) run in float32 for numerical stability. Storage and communication use bfloat16. This halves memory traffic for the bulk of computation while keeping the critical accumulation steps precise.
| Codebase | 8 GPUs | 32 GPUs | 128 GPUs | 256 GPUs |
|---|---|---|---|---|
| LingBot-VLA | 261 | 975 | 3,700 | 7,356 |
| StarVLA | 159 | 506 | 1,307 | 2,644 |
| DexBotic | 142 | 443 | 1,079 | — |
| OpenPI | 90 | 349 | — | — |
Samples per second with Qwen2.5-VL-3B backbone. LingBot-VLA maintains near-linear scaling from 8 to 256 GPUs, closely tracking the theoretical maximum.
Most VLA papers evaluate on a handful of tasks with vague protocols. LingBot-VLA uses the GM-100 benchmark — 100 carefully designed manipulation tasks, evaluated across 3 platforms, with a strict protocol that makes comparisons meaningful.
25 physical robots across 3 platforms: AgileX, AgiBot G1, and Galaxea R1Pro. All dual-arm with parallel-jaw grippers. Each has a head camera and two wrist cameras. All tasks are tabletop-based with the chassis fixed.
For each of the 100 tasks:
| Metric | Definition | Purpose |
|---|---|---|
| Success Rate (SR) | Fraction of trials completing all task steps within 3 minutes | Real-world deployment viability |
| Progress Score (PS) | Fraction of sequential sub-task checkpoints completed (e.g., 4/6 steps = 0.67) | Diagnose failure modes, reward partial success |
A trial ends early if 3 consecutive sub-task failures occur or a safety event (collision) happens. Each model gets 15 trials per task per robot for statistical robustness.
The total evaluation across all models involves 22,500 trials with comprehensive recording (third-person video, robot states, model predictions) in rosbag format. All recordings will be open-sourced for reproducibility.
This is the paper's headline result. As pre-training data scales from 3,000 to 20,000 hours, both success rate and progress score climb steadily across all three platforms.
The experiment was run on a subset of 25 representative tasks. At each data volume (3K, 5K, 8K, 12K, 20K hours), the model was pre-trained from scratch, then post-trained on the evaluation tasks.
Equally important: the scaling behavior is consistent across all three platforms. AgileX, AgiBot G1, and Galaxea R1Pro all improve with more data, and their individual trends align with the aggregate curve. The scaling law is not specific to any single embodiment.
A well-pre-trained model needs fewer demonstrations to adapt. On 8 representative tasks with the AgiBot G1 platform:
Success rate and progress score across data volumes. Both metrics climb steadily with no saturation. The three platforms track each other closely.
LingBot-VLA was compared against three state-of-the-art VLA models under strict experimental controls. All models were fine-tuned from public pre-trained checkpoints using the same data, hyperparameters, and evaluation protocol. Here are the real-world results on the GM-100 benchmark:
| Model | Success Rate | Progress Score |
|---|---|---|
| WALL-OSS | 4.05% | 10.35% |
| GR00T N1.6 | 7.59% | 15.99% |
| π0.5 | 13.02% | 27.65% |
| LingBot-VLA (no depth) | 15.74% | 33.69% |
| LingBot-VLA (w/ depth) | 17.30% | 35.41% |
| Platform | π0.5 SR | Ours (depth) SR | Improvement |
|---|---|---|---|
| AgiBot G1 | 7.77% | 11.98% | +4.21% |
| AgileX | 17.20% | 18.93% | +1.73% |
| Galaxea R1Pro | 14.10% | 20.98% | +6.88% |
GR00T N1.6 performs well on Galaxea R1Pro (14.29% SR) — comparable to π0.5 — because its pre-training included extensive Galaxea data. This highlights an important point: pre-training data composition matters as much as volume.
On 50 simulation tasks, LingBot-VLA also leads:
| Model | Clean Scenes | Randomized Scenes |
|---|---|---|
| π0.5 | 82.74% | 76.76% |
| LingBot-VLA (no depth) | 86.50% | 85.34% |
| LingBot-VLA (w/ depth) | 88.56% | 86.68% |
The gap is especially large in randomized scenes (+9.92% over π0.5), where diverse pre-training data helps most.
Average success rate across 100 tasks on the real-world GM-100 benchmark. Higher is better.
LingBot-VLA sits at the intersection of several important research threads. Let's map where it fits.
π0 pioneered the VLM + action expert design with flow matching. LingBot-VLA adopts the same high-level architecture (VLM backbone + flow-matching action head with blockwise causal attention) but demonstrates that the primary bottleneck is data scale, not architecture. LingBot-VLA's 20K-hour dataset dwarfs what π0 was trained on, and the scaling curves suggest even more data would help.
NVIDIA's GR00T N1.6 is a generalist humanoid foundation model. The GM-100 comparison reveals an interesting finding: GR00T performs well specifically on platforms whose data was heavily represented in its pre-training (Galaxea R1Pro). This underscores that pre-training data composition — not just architecture — drives cross-embodiment transfer.
LingBot-VLA provides the first evidence of scaling laws for VLA models on real-world data, analogous to the scaling laws discovered for LLMs (Kaplan et al., 2020). The implication is the same: invest in data collection infrastructure, because performance will keep improving with scale.
Diffusion Policy showed that diffusion-based action generation outperforms behavioral cloning for robot manipulation. LingBot-VLA uses flow matching (a close relative of diffusion) for the same reason: it handles multimodal action distributions better than regression losses.
| Aspect | LingBot-VLA |
|---|---|
| Architecture | VLM (Qwen2.5-VL) + Action Expert, MoT with shared self-attention |
| Action generation | Flow Matching, action chunks of T=50 |
| Pre-training data | ~20,000 hours from 9 dual-arm robot configurations |
| Evaluation | GM-100: 100 tasks, 3 platforms, 130 episodes/task, 22,500 total trials |
| Throughput | 261 samples/sec on 8 GPUs (1.5-2.8x faster than baselines) |
| Depth integration | Optional via distillation from LingBot-Depth |
| Best SR (GM-100) | 17.30% average (w/ depth), vs 13.02% for π0.5 |
| Scaling | No saturation from 3K to 20K hours |
| Open source | Code, model, benchmark data |