LingBot-VLA — Veanors

Chapter 0: The Problem

You want a robot that can fold towels, peel lemons, disassemble puzzle locks, and arrange flowers. Not in simulation — on a real physical table, with real grippers, handling real objects that slip, squish, and break.

Vision-Language-Action (VLA) models promise exactly this: feed in camera images and a language instruction, get back motor commands. Models like π0 and GR00T have made impressive progress. But a fundamental question remains unanswered:

The scaling question: How do VLA models actually scale with real-world robot data? Does performance plateau after a few thousand hours? Or does it keep climbing? Nobody had run this experiment at scale — until now.

The challenge is threefold. First, collecting real robot data is brutally expensive. Each hour requires a physical robot, a human teleoperator, and careful quality control. Second, training on massive datasets demands an efficient codebase — existing VLA frameworks bottleneck at data I/O and communication overhead. Third, evaluation is fragmented: most papers test on a handful of tasks with a single robot, making comparisons unreliable.

LingBot-VLA attacks all three problems simultaneously. They collected 20,000 hours of real teleoperation data from 9 different dual-arm robots, built a training codebase that processes 261 samples per second on 8 GPUs, and evaluated on 100 tasks across 3 platforms with 130 episodes each.

The result? Success rates climb steadily from 3,000 to 20,000 hours of pre-training data with no sign of saturation. More real data keeps helping. Drag the slider below to see the trend.

Data Scaling Curve

Drag the slider to see how success rate changes as pre-training data grows from 3K to 20K hours. The curve keeps rising — no plateau.

Training Hours 3,000h

Why has no one previously studied how VLA models scale with real-world data at this magnitude?

Collecting 20,000 hours of real robot teleoperation data requires massive hardware and human effort, and existing codebases cannot efficiently train on such volumes Real-world data is less useful than simulation data for VLA training VLA models were only invented in 2026

Chapter 1: The Key Insight

LingBot-VLA's philosophy is refreshingly pragmatic: scale real data, not architecture complexity.

Many recent VLA papers compete on clever architectural innovations — spatial representations, novel attention patterns, specialized pre-training objectives. LingBot-VLA bets on a simpler thesis: a clean architecture trained on a truly massive and diverse real-world dataset will outperform fancier models trained on less data.

Thesis 1: Data Volume

Scale pre-training from 3K to 20K hours. Success rates improve consistently and substantially. No saturation even at 20K hours.

↓

Thesis 2: Data Diversity

Collect from 9 different dual-arm robot configurations. The model learns manipulation primitives that transfer across embodiments.

↓

Thesis 3: Efficiency

Build a codebase fast enough to actually train at this scale: 261 samples/sec on 8 GPUs, 1.5-2.8x faster than existing frameworks.

Why this matters: If VLA scaling shows no saturation at 20K hours, then the path to better robot policies is clear: collect more data. This is the same insight that drove the LLM revolution — scaling laws that reward data collection over architectural novelty.

The second key insight is about evaluation rigor. Most VLA papers evaluate on 5-20 tasks with a single robot. LingBot-VLA uses the GM-100 benchmark: 100 tasks, 3 platforms, 130 post-training episodes per task per platform. This is 22,500 total evaluation trials — enough to draw statistically meaningful conclusions.

The third insight is data efficiency. With strong pre-training, LingBot-VLA needs fewer demonstrations to adapt to new tasks. With only 80 demonstrations per task, it already outperforms π0.5 using the full 130-demonstration budget. Good pre-training is a force multiplier.

What is the central bet that distinguishes LingBot-VLA from other recent VLA models?

That massive, diverse real-world data — not architectural novelty — is the primary driver of VLA performance, and that scaling shows no saturation That simulation data is sufficient for real-world deployment That a single robot configuration can generalize to all tasks

Chapter 2: Data Collection

Twenty thousand hours of robot data does not appear out of thin air. LingBot-VLA's dataset was collected across 9 different dual-arm robot embodiments, each with its own kinematic configuration, gripper design, and camera setup.

The Nine Robot Platforms

Robot	Arms	Cameras	Teleoperation
AgiBot G1	2 × 7-DoF	3 RGB-D	VR-based
AgileX	2 × 6-DoF	3 cameras	Isomorphic arms
Galaxea R1Lite	2 × 6-DoF	1 stereo + 2 wrist	Teleoperation
Galaxea R1Pro	2 × 7-DoF	1 stereo + 2 wrist	Teleoperation
Realman RS-02	2 × 7-DoF	3 cameras	Teleoperation
Leju KUAVO 4 Pro	2 × 7-DoF	1 head + 2 wrist	Humanoid
Qinglong	2 × 7-DoF	1 head + 2 wrist	Humanoid
ARX Lift2	2 × 6-DoF	3 cameras	Teleoperation
Bimanual Franka	2 × 7-DoF	3 cameras	Teleoperation

Notice the diversity: arm degrees of freedom range from 6 to 7, camera configurations vary from 2 to 3+, and teleoperation methods differ (VR, isomorphic arms, direct). This diversity is intentional — the model must learn manipulation skills that transfer across embodiment differences.

Automatic Annotation Pipeline

Raw teleoperation videos are not directly useful for training. They need language instructions that describe what the robot is doing. LingBot-VLA uses a two-stage annotation process:

Stage 1: Video Segmentation

Human annotators decompose multi-view videos into atomic action clips based on predefined action categories. Static frames at start/end are trimmed to remove dead time.

↓

Stage 2: Automatic Instruction

Qwen3-VL-235B-A22B generates precise language instructions for both full tasks ("Toast bread, add lettuce and sauce to make a sandwich") and sub-tasks ("Take the bread out of the toaster").

↓

Stage 3: Human Refinement

Human reviewers verify and correct the auto-generated instructions, ensuring accuracy of sub-task boundaries and language descriptions.

Why sub-task decomposition? A complex task like "make a sandwich" involves dozens of atomic actions: reach for bread, grasp bread, place bread in toaster, press lever, wait, extract bread, etc. Training on sub-task annotations lets the model learn reusable manipulation primitives that compose into longer sequences.

The word cloud of atomic actions in the dataset reveals over 150 distinct verbs: move, pick, fold, pour, open, release, insert, arrange, stir, stack, rotate, peel, and many more. This action vocabulary is far richer than what most VLA datasets provide.

Why does LingBot-VLA use both automatic (VLM-based) and human annotation for language instructions?

Automatic annotation with a VLM scales to 20K hours, but human refinement catches errors in sub-task boundaries and instruction accuracy that the VLM misses Human annotation is faster than VLM-based annotation The VLM cannot process video data

Chapter 3: Unified Action Space

Here is a hard problem: the 9 robot platforms have different numbers of joints, different gripper types, and different control interfaces. A 6-DoF arm and a 7-DoF arm produce action vectors of different sizes. How do you train a single model on all of them?

The Action Representation

LingBot-VLA uses continuous actions in a unified format. Each action step specifies joint positions (or velocities) for both arms plus gripper commands. For robots with fewer degrees of freedom, unused dimensions are masked or zero-padded. The key insight: despite different kinematic chains, all these dual-arm robots perform similar manipulation tasks (grasping, placing, pouring), so a shared action representation can capture the common structure.

Action Chunks

Instead of predicting one action at a time, LingBot-VLA predicts an action chunk — a sequence of T consecutive actions:

A_t = [a_t, a_t+1, ..., a_t+T-1]

During pre-training, T = 50. This means the model outputs the next 50 time steps of robot motion in one forward pass.

Why chunks, not single steps? Predicting one action at a time creates a compounding error problem: small mistakes accumulate over hundreds of steps. Action chunks provide temporal coherence — the model plans a smooth trajectory rather than reacting step-by-step. This is the same insight behind Diffusion Policy and π0.

Flow Matching for Action Generation

The action chunks are generated via Flow Matching, not autoregressive prediction. Starting from Gaussian noise, the model iteratively refines the action chunk toward a coherent trajectory. The training objective is:

L_FM = E_{s~U[0,1], A_t, ε} ||v_θ(A_t,s, O_t, s) - (A_t - ε)||²

Where A_t,s = sA_t + (1-s)ε is a linear interpolation between noise ε and the ground-truth action chunk A_t, and v_θ is the learned velocity field.

Think of it this way: the model learns to "push" random noise toward real trajectories. At inference, you start with noise and follow the learned velocity field to produce a smooth action sequence.

Why does LingBot-VLA predict action chunks of 50 steps rather than single actions?

Action chunks provide temporal coherence and smooth trajectories, avoiding the compounding error problem of step-by-step prediction Single actions require too much GPU memory The robot hardware only accepts 50 commands at a time

Chapter 4: The Architecture

LingBot-VLA has a clean two-expert design: an understanding expert that processes visual and language inputs, and an action expert that generates motor commands. They communicate through shared self-attention.

Understanding Expert: The VLM Backbone

The understanding expert is a pre-trained Vision-Language Model (Qwen2.5-VL). It takes three inputs:

Three-view images (I₁, I₂, I₃): typically a head camera and two wrist cameras, providing the robot's view of the workspace
Task instruction T: the language command ("Arrange flowers in the vase")
Robot state s: the current joint positions and gripper state

The VLM encodes these into a rich multimodal representation. It understands what objects are present, what the task requires, and what the current robot configuration looks like.

Action Expert: The Motion Generator

The action expert is a separate transformer pathway that receives the robot's proprioceptive state and produces action chunks via flow matching. It has its own feed-forward layers and embeddings, but shares self-attention with the understanding expert.

Mixture-of-Transformers (MoT)

The two experts are woven together using a Mixture-of-Transformers architecture (inspired by BAGEL). In each layer:

The understanding expert processes vision-language tokens through its own FFN
The action expert processes action tokens through its own FFN
Both share a single self-attention mechanism, so action tokens can attend to visual/language tokens and vice versa

Why shared attention, separate FFNs? The self-attention lets action generation be continuously guided by visual understanding — the action expert can "ask" the VLM about spatial relationships, object locations, and task progress at every layer. But separate FFNs prevent cross-modal interference: the vision-language processing stays clean, and action-specific computations don't corrupt semantic representations.

Blockwise Causal Attention

The joint sequence [O_t, A_t] is split into three blocks:

Block 1: image and text tokens [I₁, I₂, I₃, T] — bidirectional attention within
Block 2: robot state [s] — can attend to Block 1 and itself
Block 3: action chunk [a_t, ..., a_t+T-1] — can attend to all preceding blocks

This causal structure ensures the action expert has access to all observation context, but future actions cannot leak information backward into the observation representations.

Optional Depth Integration

For tasks requiring precise spatial awareness, LingBot-VLA adds a depth distillation mechanism. Learnable queries [Q₁, Q₂, Q₃] are processed through the VLM and then aligned with depth tokens from LingBot-Depth via a distillation loss:

L_distill = E_Q |Proj(Q) - D|

This infuses geometric information without requiring depth at inference time — the VLM learns to internalize spatial reasoning during training.

LingBot-VLA Architecture

The understanding expert (VLM) processes images and language. The action expert generates motor commands. They share self-attention but have separate feed-forward networks.

Why do the understanding expert and action expert share self-attention but have separate feed-forward networks?

Shared attention lets action generation be guided by visual understanding at every layer, while separate FFNs prevent cross-modal interference between vision-language and action processing It reduces the total number of parameters Separate FFNs allow training on different GPUs

Chapter 5: The Training Pipeline

Training a VLA on 20,000 hours of data is a serious engineering challenge. Without an efficient pipeline, a single training run could take months. LingBot-VLA achieves 261 samples per second on 8 GPUs — 1.5 to 2.8 times faster than existing codebases.

Distributed Strategy

LingBot-VLA uses FSDP (Fully Sharded Data Parallel), PyTorch's implementation of the ZeRO optimizer. This shards optimizer states, model parameters, and gradients across GPUs to minimize memory footprint.

But here is the clever part: they construct specific shard groups exclusively for the action expert modules, inspired by Hybrid Sharded Data Parallel (HSDP). The VLM backbone is large but its gradients are well-behaved. The action expert is smaller but its parameters are updated more aggressively. By sharding them differently, they reduce communication overhead without sacrificing memory efficiency.

Operator-Level Optimization

The multimodal fusion of vision, language, and action is fundamentally a sparse attention process — not all tokens need to attend to all other tokens. LingBot-VLA exploits this with two techniques:

FlexAttention: computes only the attention entries that matter (following the blockwise causal mask), avoiding wasted FLOPs on masked-out token pairs
torch.compile: fuses multiple small operations into single GPU kernels, reducing kernel launch overhead and maximizing memory bandwidth

Mixed Precision

Reductions (gradient aggregation across GPUs) run in float32 for numerical stability. Storage and communication use bfloat16. This halves memory traffic for the bulk of computation while keeping the critical accumulation steps precise.

Throughput Scaling

Codebase	8 GPUs	32 GPUs	128 GPUs	256 GPUs
LingBot-VLA	261	975	3,700	7,356
StarVLA	159	506	1,307	2,644
DexBotic	142	443	1,079	—
OpenPI	90	349	—	—

Samples per second with Qwen2.5-VL-3B backbone. LingBot-VLA maintains near-linear scaling from 8 to 256 GPUs, closely tracking the theoretical maximum.

Training consists of two stages: (1) Pre-training on the full 20K-hour dataset to learn general manipulation skills, then (2) Post-training on task-specific demonstrations (130 per task) to adapt to specific evaluation tasks.

What two optimizations let LingBot-VLA achieve 1.5-2.8x higher throughput than existing VLA codebases?

Using smaller models and less data Training in simulation instead of on real data FlexAttention for sparse attention computation and HSDP-style shard groups that reduce communication overhead for the action expert

Chapter 6: The GM-100 Benchmark

Most VLA papers evaluate on a handful of tasks with vague protocols. LingBot-VLA uses the GM-100 benchmark — 100 carefully designed manipulation tasks, evaluated across 3 platforms, with a strict protocol that makes comparisons meaningful.

Hardware

25 physical robots across 3 platforms: AgileX, AgiBot G1, and Galaxea R1Pro. All dual-arm with parallel-jaw grippers. Each has a head camera and two wrist cameras. All tasks are tabletop-based with the chassis fixed.

Data Collection Protocol

For each of the 100 tasks:

150 raw trajectories are collected via teleoperation per platform
The top 130 (ranked by execution quality, motion smoothness, protocol adherence) are retained
Object poses are randomized for each trajectory to prevent spatial memorization
Automated filtering removes technical anomalies, followed by manual review

Evaluation Metrics

Metric	Definition	Purpose
Success Rate (SR)	Fraction of trials completing all task steps within 3 minutes	Real-world deployment viability
Progress Score (PS)	Fraction of sequential sub-task checkpoints completed (e.g., 4/6 steps = 0.67)	Diagnose failure modes, reward partial success

A trial ends early if 3 consecutive sub-task failures occur or a safety event (collision) happens. Each model gets 15 trials per task per robot for statistical robustness.

Action diversity as a stress test: About 50% of the atomic actions in the test set are absent from the top 100 most frequent training actions. The benchmark deliberately includes novel action compositions to test whether models truly generalize or merely memorize common patterns.

The total evaluation across all models involves 22,500 trials with comprehensive recording (third-person video, robot states, model predictions) in rosbag format. All recordings will be open-sourced for reproducibility.

Why does the GM-100 benchmark use both Success Rate and Progress Score?

Success Rate measures deployment viability (did the task fully complete?), while Progress Score diagnoses failure modes and rewards partial progress through multi-step tasks Progress Score is faster to compute than Success Rate Success Rate only works for single-step tasks

Chapter 7: Scaling Results

This is the paper's headline result. As pre-training data scales from 3,000 to 20,000 hours, both success rate and progress score climb steadily across all three platforms.

The Scaling Curves

The experiment was run on a subset of 25 representative tasks. At each data volume (3K, 5K, 8K, 12K, 20K hours), the model was pre-trained from scratch, then post-trained on the evaluation tasks.

The key finding: At 20,000 hours, the scaling curves show no sign of saturation. Success rates keep climbing at a consistent rate. This means the community has not yet found the ceiling for real-world VLA scaling — more data will likely yield further improvements.

Equally important: the scaling behavior is consistent across all three platforms. AgileX, AgiBot G1, and Galaxea R1Pro all improve with more data, and their individual trends align with the aggregate curve. The scaling law is not specific to any single embodiment.

Data Efficiency

A well-pre-trained model needs fewer demonstrations to adapt. On 8 representative tasks with the AgiBot G1 platform:

LingBot-VLA with 80 demonstrations outperforms π0.5 with 130 demonstrations in both SR and PS
The gap widens as more post-training data is added, showing the pre-training provides a better foundation for learning

Scaling Behavior

Success rate and progress score across data volumes. Both metrics climb steadily with no saturation. The three platforms track each other closely.

Metric Success Rate

What does the absence of saturation at 20,000 hours imply for the VLA research community?

That collecting more real-world data will likely continue to improve VLA performance — the community has not yet found the data scaling ceiling That 20,000 hours is the optimal dataset size That simulation data is no longer needed

Chapter 8: Comparison

LingBot-VLA was compared against three state-of-the-art VLA models under strict experimental controls. All models were fine-tuned from public pre-trained checkpoints using the same data, hyperparameters, and evaluation protocol. Here are the real-world results on the GM-100 benchmark:

Real-World GM-100 Results (Average across 100 tasks)

Model	Success Rate	Progress Score
WALL-OSS	4.05%	10.35%
GR00T N1.6	7.59%	15.99%
π0.5	13.02%	27.65%
LingBot-VLA (no depth)	15.74%	33.69%
LingBot-VLA (w/ depth)	17.30%	35.41%

Reading these numbers: These success rates may seem low — even the best model only completes 17% of tasks. But GM-100 is an exceptionally hard benchmark: 100 diverse tasks including multi-step sequences like "Toast bread, add lettuce and sauce to make a sandwich." A 17% average over such tasks, across 3 platforms, is a strong result. And the relative ordering is clear: LingBot-VLA w/ depth beats π0.5 by 4.28% SR and 7.76% PS.

Per-Platform Breakdown

Platform	π0.5 SR	Ours (depth) SR	Improvement
AgiBot G1	7.77%	11.98%	+4.21%
AgileX	17.20%	18.93%	+1.73%
Galaxea R1Pro	14.10%	20.98%	+6.88%

GR00T N1.6 performs well on Galaxea R1Pro (14.29% SR) — comparable to π0.5 — because its pre-training included extensive Galaxea data. This highlights an important point: pre-training data composition matters as much as volume.

Simulation Benchmark (RoboTwin 2.0)

On 50 simulation tasks, LingBot-VLA also leads:

Model	Clean Scenes	Randomized Scenes
π0.5	82.74%	76.76%
LingBot-VLA (no depth)	86.50%	85.34%
LingBot-VLA (w/ depth)	88.56%	86.68%

The gap is especially large in randomized scenes (+9.92% over π0.5), where diverse pre-training data helps most.

GM-100 Success Rate Comparison

Average success rate across 100 tasks on the real-world GM-100 benchmark. Higher is better.

By how much does LingBot-VLA (with depth) outperform π0.5 in average success rate on GM-100?

+4.28% absolute (17.30% vs 13.02%), with even larger gains in progress score (+7.76%) +50% relative improvement The models perform the same within error bars

Chapter 9: Connections

LingBot-VLA sits at the intersection of several important research threads. Let's map where it fits.

Relation to π0 and π0.5

π0 pioneered the VLM + action expert design with flow matching. LingBot-VLA adopts the same high-level architecture (VLM backbone + flow-matching action head with blockwise causal attention) but demonstrates that the primary bottleneck is data scale, not architecture. LingBot-VLA's 20K-hour dataset dwarfs what π0 was trained on, and the scaling curves suggest even more data would help.

Relation to GR00T N1.6

NVIDIA's GR00T N1.6 is a generalist humanoid foundation model. The GM-100 comparison reveals an interesting finding: GR00T performs well specifically on platforms whose data was heavily represented in its pre-training (Galaxea R1Pro). This underscores that pre-training data composition — not just architecture — drives cross-embodiment transfer.

Relation to Scaling Laws in LLMs

LingBot-VLA provides the first evidence of scaling laws for VLA models on real-world data, analogous to the scaling laws discovered for LLMs (Kaplan et al., 2020). The implication is the same: invest in data collection infrastructure, because performance will keep improving with scale.

Relation to Diffusion Policy

Diffusion Policy showed that diffusion-based action generation outperforms behavioral cloning for robot manipulation. LingBot-VLA uses flow matching (a close relative of diffusion) for the same reason: it handles multimodal action distributions better than regression losses.

Cheat Sheet

Aspect	LingBot-VLA
Architecture	VLM (Qwen2.5-VL) + Action Expert, MoT with shared self-attention
Action generation	Flow Matching, action chunks of T=50
Pre-training data	~20,000 hours from 9 dual-arm robot configurations
Evaluation	GM-100: 100 tasks, 3 platforms, 130 episodes/task, 22,500 total trials
Throughput	261 samples/sec on 8 GPUs (1.5-2.8x faster than baselines)
Depth integration	Optional via distillation from LingBot-Depth
Best SR (GM-100)	17.30% average (w/ depth), vs 13.02% for π0.5
Scaling	No saturation from 3K to 20K hours
Open source	Code, model, benchmark data

The broader lesson: In the VLA domain, just as in language modeling, data scale is a reliable lever for improving performance. A pragmatic, well-engineered system trained on massive diverse real-world data outperforms more architecturally complex models trained on less data. The scaling curves have not plateaued — the race for real-world robot data has just begun.

What parallel does LingBot-VLA's scaling study draw with the history of large language models?

Just as LLMs showed that more text data reliably improves performance (scaling laws), LingBot-VLA provides the first evidence that VLA models also scale predictably with real-world robot data volume Both LLMs and VLAs use the same transformer architecture Both require simulation data for pre-training

A Pragmatic VLA Foundation Model