Wu, Lu, Wang, Yang, Liu, et al. — Robbyant, 2026

A Pragmatic VLA Foundation Model

Scale real-world robot data to 20,000 hours across 9 dual-arm embodiments. No saturation in sight. LingBot-VLA combines a VLM understanding expert with a flow-matching action expert, evaluated on 100 tasks across 3 platforms.

Prerequisites: Vision-Language Models + Diffusion / Flow Matching + Imitation Learning basics
10
Chapters
4
Simulations

Chapter 0: The Problem

You want a robot that can fold towels, peel lemons, disassemble puzzle locks, and arrange flowers. Not in simulation — on a real physical table, with real grippers, handling real objects that slip, squish, and break.

Vision-Language-Action (VLA) models promise exactly this: feed in camera images and a language instruction, get back motor commands. Models like π0 and GR00T have made impressive progress. But a fundamental question remains unanswered:

The scaling question: How do VLA models actually scale with real-world robot data? Does performance plateau after a few thousand hours? Or does it keep climbing? Nobody had run this experiment at scale — until now.

The challenge is threefold. First, collecting real robot data is brutally expensive. Each hour requires a physical robot, a human teleoperator, and careful quality control. Second, training on massive datasets demands an efficient codebase — existing VLA frameworks bottleneck at data I/O and communication overhead. Third, evaluation is fragmented: most papers test on a handful of tasks with a single robot, making comparisons unreliable.

LingBot-VLA attacks all three problems simultaneously. They collected 20,000 hours of real teleoperation data from 9 different dual-arm robots, built a training codebase that processes 261 samples per second on 8 GPUs, and evaluated on 100 tasks across 3 platforms with 130 episodes each.

The result? Success rates climb steadily from 3,000 to 20,000 hours of pre-training data with no sign of saturation. More real data keeps helping. Drag the slider below to see the trend.

Data Scaling Curve

Drag the slider to see how success rate changes as pre-training data grows from 3K to 20K hours. The curve keeps rising — no plateau.

Training Hours 3,000h
Why has no one previously studied how VLA models scale with real-world data at this magnitude?

Chapter 1: The Key Insight

LingBot-VLA's philosophy is refreshingly pragmatic: scale real data, not architecture complexity.

Many recent VLA papers compete on clever architectural innovations — spatial representations, novel attention patterns, specialized pre-training objectives. LingBot-VLA bets on a simpler thesis: a clean architecture trained on a truly massive and diverse real-world dataset will outperform fancier models trained on less data.

Thesis 1: Data Volume
Scale pre-training from 3K to 20K hours. Success rates improve consistently and substantially. No saturation even at 20K hours.
Thesis 2: Data Diversity
Collect from 9 different dual-arm robot configurations. The model learns manipulation primitives that transfer across embodiments.
Thesis 3: Efficiency
Build a codebase fast enough to actually train at this scale: 261 samples/sec on 8 GPUs, 1.5-2.8x faster than existing frameworks.
Why this matters: If VLA scaling shows no saturation at 20K hours, then the path to better robot policies is clear: collect more data. This is the same insight that drove the LLM revolution — scaling laws that reward data collection over architectural novelty.

The second key insight is about evaluation rigor. Most VLA papers evaluate on 5-20 tasks with a single robot. LingBot-VLA uses the GM-100 benchmark: 100 tasks, 3 platforms, 130 post-training episodes per task per platform. This is 22,500 total evaluation trials — enough to draw statistically meaningful conclusions.

The third insight is data efficiency. With strong pre-training, LingBot-VLA needs fewer demonstrations to adapt to new tasks. With only 80 demonstrations per task, it already outperforms π0.5 using the full 130-demonstration budget. Good pre-training is a force multiplier.

What is the central bet that distinguishes LingBot-VLA from other recent VLA models?

Chapter 2: Data Collection

Twenty thousand hours of robot data does not appear out of thin air. LingBot-VLA's dataset was collected across 9 different dual-arm robot embodiments, each with its own kinematic configuration, gripper design, and camera setup.

The Nine Robot Platforms

RobotArmsCamerasTeleoperation
AgiBot G12 × 7-DoF3 RGB-DVR-based
AgileX2 × 6-DoF3 camerasIsomorphic arms
Galaxea R1Lite2 × 6-DoF1 stereo + 2 wristTeleoperation
Galaxea R1Pro2 × 7-DoF1 stereo + 2 wristTeleoperation
Realman RS-022 × 7-DoF3 camerasTeleoperation
Leju KUAVO 4 Pro2 × 7-DoF1 head + 2 wristHumanoid
Qinglong2 × 7-DoF1 head + 2 wristHumanoid
ARX Lift22 × 6-DoF3 camerasTeleoperation
Bimanual Franka2 × 7-DoF3 camerasTeleoperation

Notice the diversity: arm degrees of freedom range from 6 to 7, camera configurations vary from 2 to 3+, and teleoperation methods differ (VR, isomorphic arms, direct). This diversity is intentional — the model must learn manipulation skills that transfer across embodiment differences.

Automatic Annotation Pipeline

Raw teleoperation videos are not directly useful for training. They need language instructions that describe what the robot is doing. LingBot-VLA uses a two-stage annotation process:

Stage 1: Video Segmentation
Human annotators decompose multi-view videos into atomic action clips based on predefined action categories. Static frames at start/end are trimmed to remove dead time.
Stage 2: Automatic Instruction
Qwen3-VL-235B-A22B generates precise language instructions for both full tasks ("Toast bread, add lettuce and sauce to make a sandwich") and sub-tasks ("Take the bread out of the toaster").
Stage 3: Human Refinement
Human reviewers verify and correct the auto-generated instructions, ensuring accuracy of sub-task boundaries and language descriptions.
Why sub-task decomposition? A complex task like "make a sandwich" involves dozens of atomic actions: reach for bread, grasp bread, place bread in toaster, press lever, wait, extract bread, etc. Training on sub-task annotations lets the model learn reusable manipulation primitives that compose into longer sequences.

The word cloud of atomic actions in the dataset reveals over 150 distinct verbs: move, pick, fold, pour, open, release, insert, arrange, stir, stack, rotate, peel, and many more. This action vocabulary is far richer than what most VLA datasets provide.

Why does LingBot-VLA use both automatic (VLM-based) and human annotation for language instructions?

Chapter 3: Unified Action Space

Here is a hard problem: the 9 robot platforms have different numbers of joints, different gripper types, and different control interfaces. A 6-DoF arm and a 7-DoF arm produce action vectors of different sizes. How do you train a single model on all of them?

The Action Representation

LingBot-VLA uses continuous actions in a unified format. Each action step specifies joint positions (or velocities) for both arms plus gripper commands. For robots with fewer degrees of freedom, unused dimensions are masked or zero-padded. The key insight: despite different kinematic chains, all these dual-arm robots perform similar manipulation tasks (grasping, placing, pouring), so a shared action representation can capture the common structure.

Action Chunks

Instead of predicting one action at a time, LingBot-VLA predicts an action chunk — a sequence of T consecutive actions:

At = [at, at+1, ..., at+T-1]

During pre-training, T = 50. This means the model outputs the next 50 time steps of robot motion in one forward pass.

Why chunks, not single steps? Predicting one action at a time creates a compounding error problem: small mistakes accumulate over hundreds of steps. Action chunks provide temporal coherence — the model plans a smooth trajectory rather than reacting step-by-step. This is the same insight behind Diffusion Policy and π0.

Flow Matching for Action Generation

The action chunks are generated via Flow Matching, not autoregressive prediction. Starting from Gaussian noise, the model iteratively refines the action chunk toward a coherent trajectory. The training objective is:

LFM = Es~U[0,1], At, ε ||vθ(At,s, Ot, s) - (At - ε)||2

Where At,s = sAt + (1-s)ε is a linear interpolation between noise ε and the ground-truth action chunk At, and vθ is the learned velocity field.

Think of it this way: the model learns to "push" random noise toward real trajectories. At inference, you start with noise and follow the learned velocity field to produce a smooth action sequence.

Why does LingBot-VLA predict action chunks of 50 steps rather than single actions?

Chapter 4: The Architecture

LingBot-VLA has a clean two-expert design: an understanding expert that processes visual and language inputs, and an action expert that generates motor commands. They communicate through shared self-attention.

Understanding Expert: The VLM Backbone

The understanding expert is a pre-trained Vision-Language Model (Qwen2.5-VL). It takes three inputs:

The VLM encodes these into a rich multimodal representation. It understands what objects are present, what the task requires, and what the current robot configuration looks like.

Action Expert: The Motion Generator

The action expert is a separate transformer pathway that receives the robot's proprioceptive state and produces action chunks via flow matching. It has its own feed-forward layers and embeddings, but shares self-attention with the understanding expert.

Mixture-of-Transformers (MoT)

The two experts are woven together using a Mixture-of-Transformers architecture (inspired by BAGEL). In each layer:

Why shared attention, separate FFNs? The self-attention lets action generation be continuously guided by visual understanding — the action expert can "ask" the VLM about spatial relationships, object locations, and task progress at every layer. But separate FFNs prevent cross-modal interference: the vision-language processing stays clean, and action-specific computations don't corrupt semantic representations.

Blockwise Causal Attention

The joint sequence [Ot, At] is split into three blocks:

This causal structure ensures the action expert has access to all observation context, but future actions cannot leak information backward into the observation representations.

Optional Depth Integration

For tasks requiring precise spatial awareness, LingBot-VLA adds a depth distillation mechanism. Learnable queries [Q1, Q2, Q3] are processed through the VLM and then aligned with depth tokens from LingBot-Depth via a distillation loss:

Ldistill = EQ |Proj(Q) - D|

This infuses geometric information without requiring depth at inference time — the VLM learns to internalize spatial reasoning during training.

LingBot-VLA Architecture

The understanding expert (VLM) processes images and language. The action expert generates motor commands. They share self-attention but have separate feed-forward networks.

Why do the understanding expert and action expert share self-attention but have separate feed-forward networks?

Chapter 5: The Training Pipeline

Training a VLA on 20,000 hours of data is a serious engineering challenge. Without an efficient pipeline, a single training run could take months. LingBot-VLA achieves 261 samples per second on 8 GPUs — 1.5 to 2.8 times faster than existing codebases.

Distributed Strategy

LingBot-VLA uses FSDP (Fully Sharded Data Parallel), PyTorch's implementation of the ZeRO optimizer. This shards optimizer states, model parameters, and gradients across GPUs to minimize memory footprint.

But here is the clever part: they construct specific shard groups exclusively for the action expert modules, inspired by Hybrid Sharded Data Parallel (HSDP). The VLM backbone is large but its gradients are well-behaved. The action expert is smaller but its parameters are updated more aggressively. By sharding them differently, they reduce communication overhead without sacrificing memory efficiency.

Operator-Level Optimization

The multimodal fusion of vision, language, and action is fundamentally a sparse attention process — not all tokens need to attend to all other tokens. LingBot-VLA exploits this with two techniques:

Mixed Precision

Reductions (gradient aggregation across GPUs) run in float32 for numerical stability. Storage and communication use bfloat16. This halves memory traffic for the bulk of computation while keeping the critical accumulation steps precise.

Throughput Scaling

Codebase8 GPUs32 GPUs128 GPUs256 GPUs
LingBot-VLA2619753,7007,356
StarVLA1595061,3072,644
DexBotic1424431,079
OpenPI90349

Samples per second with Qwen2.5-VL-3B backbone. LingBot-VLA maintains near-linear scaling from 8 to 256 GPUs, closely tracking the theoretical maximum.

Training consists of two stages: (1) Pre-training on the full 20K-hour dataset to learn general manipulation skills, then (2) Post-training on task-specific demonstrations (130 per task) to adapt to specific evaluation tasks.
What two optimizations let LingBot-VLA achieve 1.5-2.8x higher throughput than existing VLA codebases?

Chapter 6: The GM-100 Benchmark

Most VLA papers evaluate on a handful of tasks with vague protocols. LingBot-VLA uses the GM-100 benchmark — 100 carefully designed manipulation tasks, evaluated across 3 platforms, with a strict protocol that makes comparisons meaningful.

Hardware

25 physical robots across 3 platforms: AgileX, AgiBot G1, and Galaxea R1Pro. All dual-arm with parallel-jaw grippers. Each has a head camera and two wrist cameras. All tasks are tabletop-based with the chassis fixed.

Data Collection Protocol

For each of the 100 tasks:

Evaluation Metrics

MetricDefinitionPurpose
Success Rate (SR)Fraction of trials completing all task steps within 3 minutesReal-world deployment viability
Progress Score (PS)Fraction of sequential sub-task checkpoints completed (e.g., 4/6 steps = 0.67)Diagnose failure modes, reward partial success

A trial ends early if 3 consecutive sub-task failures occur or a safety event (collision) happens. Each model gets 15 trials per task per robot for statistical robustness.

Action diversity as a stress test: About 50% of the atomic actions in the test set are absent from the top 100 most frequent training actions. The benchmark deliberately includes novel action compositions to test whether models truly generalize or merely memorize common patterns.

The total evaluation across all models involves 22,500 trials with comprehensive recording (third-person video, robot states, model predictions) in rosbag format. All recordings will be open-sourced for reproducibility.

Why does the GM-100 benchmark use both Success Rate and Progress Score?

Chapter 7: Scaling Results

This is the paper's headline result. As pre-training data scales from 3,000 to 20,000 hours, both success rate and progress score climb steadily across all three platforms.

The Scaling Curves

The experiment was run on a subset of 25 representative tasks. At each data volume (3K, 5K, 8K, 12K, 20K hours), the model was pre-trained from scratch, then post-trained on the evaluation tasks.

The key finding: At 20,000 hours, the scaling curves show no sign of saturation. Success rates keep climbing at a consistent rate. This means the community has not yet found the ceiling for real-world VLA scaling — more data will likely yield further improvements.

Equally important: the scaling behavior is consistent across all three platforms. AgileX, AgiBot G1, and Galaxea R1Pro all improve with more data, and their individual trends align with the aggregate curve. The scaling law is not specific to any single embodiment.

Data Efficiency

A well-pre-trained model needs fewer demonstrations to adapt. On 8 representative tasks with the AgiBot G1 platform:

Scaling Behavior

Success rate and progress score across data volumes. Both metrics climb steadily with no saturation. The three platforms track each other closely.

Metric Success Rate
What does the absence of saturation at 20,000 hours imply for the VLA research community?

Chapter 8: Comparison

LingBot-VLA was compared against three state-of-the-art VLA models under strict experimental controls. All models were fine-tuned from public pre-trained checkpoints using the same data, hyperparameters, and evaluation protocol. Here are the real-world results on the GM-100 benchmark:

Real-World GM-100 Results (Average across 100 tasks)

ModelSuccess RateProgress Score
WALL-OSS4.05%10.35%
GR00T N1.67.59%15.99%
π0.513.02%27.65%
LingBot-VLA (no depth)15.74%33.69%
LingBot-VLA (w/ depth)17.30%35.41%
Reading these numbers: These success rates may seem low — even the best model only completes 17% of tasks. But GM-100 is an exceptionally hard benchmark: 100 diverse tasks including multi-step sequences like "Toast bread, add lettuce and sauce to make a sandwich." A 17% average over such tasks, across 3 platforms, is a strong result. And the relative ordering is clear: LingBot-VLA w/ depth beats π0.5 by 4.28% SR and 7.76% PS.

Per-Platform Breakdown

Platformπ0.5 SROurs (depth) SRImprovement
AgiBot G17.77%11.98%+4.21%
AgileX17.20%18.93%+1.73%
Galaxea R1Pro14.10%20.98%+6.88%

GR00T N1.6 performs well on Galaxea R1Pro (14.29% SR) — comparable to π0.5 — because its pre-training included extensive Galaxea data. This highlights an important point: pre-training data composition matters as much as volume.

Simulation Benchmark (RoboTwin 2.0)

On 50 simulation tasks, LingBot-VLA also leads:

ModelClean ScenesRandomized Scenes
π0.582.74%76.76%
LingBot-VLA (no depth)86.50%85.34%
LingBot-VLA (w/ depth)88.56%86.68%

The gap is especially large in randomized scenes (+9.92% over π0.5), where diverse pre-training data helps most.

GM-100 Success Rate Comparison

Average success rate across 100 tasks on the real-world GM-100 benchmark. Higher is better.

By how much does LingBot-VLA (with depth) outperform π0.5 in average success rate on GM-100?

Chapter 9: Connections

LingBot-VLA sits at the intersection of several important research threads. Let's map where it fits.

Relation to π0 and π0.5

π0 pioneered the VLM + action expert design with flow matching. LingBot-VLA adopts the same high-level architecture (VLM backbone + flow-matching action head with blockwise causal attention) but demonstrates that the primary bottleneck is data scale, not architecture. LingBot-VLA's 20K-hour dataset dwarfs what π0 was trained on, and the scaling curves suggest even more data would help.

Relation to GR00T N1.6

NVIDIA's GR00T N1.6 is a generalist humanoid foundation model. The GM-100 comparison reveals an interesting finding: GR00T performs well specifically on platforms whose data was heavily represented in its pre-training (Galaxea R1Pro). This underscores that pre-training data composition — not just architecture — drives cross-embodiment transfer.

Relation to Scaling Laws in LLMs

LingBot-VLA provides the first evidence of scaling laws for VLA models on real-world data, analogous to the scaling laws discovered for LLMs (Kaplan et al., 2020). The implication is the same: invest in data collection infrastructure, because performance will keep improving with scale.

Relation to Diffusion Policy

Diffusion Policy showed that diffusion-based action generation outperforms behavioral cloning for robot manipulation. LingBot-VLA uses flow matching (a close relative of diffusion) for the same reason: it handles multimodal action distributions better than regression losses.

Cheat Sheet

AspectLingBot-VLA
ArchitectureVLM (Qwen2.5-VL) + Action Expert, MoT with shared self-attention
Action generationFlow Matching, action chunks of T=50
Pre-training data~20,000 hours from 9 dual-arm robot configurations
EvaluationGM-100: 100 tasks, 3 platforms, 130 episodes/task, 22,500 total trials
Throughput261 samples/sec on 8 GPUs (1.5-2.8x faster than baselines)
Depth integrationOptional via distillation from LingBot-Depth
Best SR (GM-100)17.30% average (w/ depth), vs 13.02% for π0.5
ScalingNo saturation from 3K to 20K hours
Open sourceCode, model, benchmark data
The broader lesson: In the VLA domain, just as in language modeling, data scale is a reliable lever for improving performance. A pragmatic, well-engineered system trained on massive diverse real-world data outperforms more architecturally complex models trained on less data. The scaling curves have not plateaued — the race for real-world robot data has just begun.
What parallel does LingBot-VLA's scaling study draw with the history of large language models?