VLA-0: Zero Modification VLAs

Chapter 0: The Problem

You want to build a robot that follows language instructions. You have a powerful Vision-Language Model — it can describe images, answer questions, write code. Now you want it to also output robot actions: move the arm here, close the gripper, rotate the wrist.

The obvious approach: take your VLM and add an action output. But how you add that output matters enormously. And every existing approach requires modifying the VLM in some way.

What people have tried

New vocabulary tokens — RT-2 and OpenVLA add 256 special tokens to the VLM's vocabulary, one per discretized action bin. This corrupts the original embedding space and limits action resolution to 256 levels.
Diffusion action heads — π0 and SmolVLA attach a flow-matching or diffusion network to the VLM's hidden states. This adds millions of untrained parameters and risks degrading the VLM's language grounding.
Custom tokenizers — π-FAST builds a dedicated action tokenizer with a learned codebook. Effective, but complex to implement and train.

Every one of these modifications introduces engineering complexity, training instability, and the risk of catastrophic forgetting — where the VLM loses its original language and vision capabilities while learning to predict actions.

The uncomfortable question: What if all these modifications are unnecessary? What if a VLM can already predict robot actions — as plain text — without changing a single thing about its architecture?

That is exactly what VLA-0 shows. Take a VLM. Change nothing. Fine-tune it to output actions as space-separated integers. It outperforms every alternative.

What is the core risk of modifying a VLM's architecture to add action outputs?

It makes inference slower Modifications can cause catastrophic forgetting — the VLM loses language/vision capabilities while learning actions The action outputs are always lower quality

Chapter 1: The Key Insight

Here is the insight that makes VLA-0 work: VLMs already know how to output numbers.

Think about it. A model like Qwen-VL-2.5 has been trained on billions of text tokens. It has seen coordinates, measurements, prices, dates, temperatures — numbers in every conceivable context. It knows the token "423" and the token "17" and how to produce sequences of space-separated integers.

A robot action is just a vector of numbers. A 7-DoF arm has 7 continuous values per timestep: x, y, z position, roll, pitch, yaw orientation, and gripper state. If you normalize each value to an integer in [0, 1000] and write them as text, you get something like:

423 17 891 500 334 672 1000

That is a perfectly valid text string. The VLM can generate it with standard autoregressive decoding. No new tokens. No action heads. No vocabulary modifications.

Why integers, not floats? Floating-point numbers like "0.4231" require multiple tokens per digit and introduce ambiguity in tokenization. The integer "423" is a single clean token in most vocabularies. By mapping continuous actions to the integer range [0, 1000], VLA-0 gets high resolution (0.1% per step) with maximally efficient tokenization.

The system prompt

VLA-0 uses a carefully designed system prompt that tells the VLM exactly what to output:

"Analyze the input image and predict robot actions
for the next H timesteps. Each action has D dimensions.
Output a single sequence of H x D integers (0-B each),
representing the H timesteps sequentially.
Provide only space-separated numbers. Nothing else."

For a 7-DoF arm predicting 4 timesteps ahead, the model outputs 28 integers: 423 17 891 500 334 672 1000 418 22 887 .... The first 7 are timestep 1, the next 7 are timestep 2, and so on.

The data flow: Image (224×224×3) + system prompt (text) → Qwen-VL-2.5 3B (frozen architecture, fine-tuned weights) → text string of H×D space-separated integers → parse to integers → denormalize to continuous values in original action range → send to robot.

Why does VLA-0 use integers instead of floating-point numbers for action representation?

Integers like "423" are single clean tokens in most vocabularies, while floats like "0.4231" require multiple tokens and introduce tokenization ambiguity Integers are more mathematically precise The robot hardware only accepts integer commands

Chapter 2: The VLA Zoo

Before we go deeper into VLA-0's method, let's understand where it fits in the landscape. The paper identifies four families of Vision-Language-Action models, each with a different strategy for getting actions out of a VLM.

Family 1: Discrete Token VLAs

Examples: RT-2, OpenVLA

These models add new tokens to the VLM's vocabulary — typically 256 per action dimension. Action prediction becomes next-token prediction over these special tokens. The problem: 256 bins is coarse resolution. And injecting new tokens into a pre-trained vocabulary corrupts the embedding space — the model has to learn what these tokens mean from scratch, which can interfere with existing language understanding.

Family 2: Generative Action Head VLAs

Examples: π0, SmolVLA

These keep the VLM's vocabulary untouched but attach a new neural network (a diffusion or flow-matching head) that reads the VLM's hidden states and generates continuous actions. The VLM produces a latent representation; the action head decodes it into motor commands. The problem: the action head is randomly initialized and must be trained from scratch. It introduces millions of new parameters that can degrade the VLM's language grounding during joint fine-tuning.

Family 3: Custom Architecture VLAs

Examples: OpenVLA-OFT, π-FAST

These design specialized modules — learned action tokenizers, custom codebooks, separate action decoders. They are often very effective but require significant engineering effort. Each new design choice (codebook size, tokenizer architecture, decoder depth) adds hyperparameters and potential failure modes.

Family 4: VLA-0 (Zero Modification)

The approach: Change nothing. Output actions as plain text integers. The VLM's existing vocabulary, architecture, and text generation pipeline are used as-is. Fine-tuning teaches the model what to output, not how to output.

Why does complexity seem necessary?

The assumption behind Families 1-3 is that robot actions are fundamentally different from text — they are continuous, high-frequency, and multi-dimensional. Therefore, the reasoning goes, you need specialized output mechanisms to handle them. Discrete token VLAs quantize actions into vocabulary tokens. Generative heads produce continuous outputs via learned denoising. Custom architectures build purpose-built decoders.

VLA-0 challenges this assumption: actions are just sequences of numbers, and VLMs already excel at generating sequences of numbers. The perceived gap between "text" and "actions" is an engineering choice, not a fundamental barrier.

The taxonomy principle: Families 1-3 modify the VLM's output mechanism to handle actions. VLA-0 modifies nothing — it reframes actions as text, which the VLM already knows how to produce. The simplicity is the innovation.

What distinguishes VLA-0 from all other VLA families?

It uses a larger VLM It makes zero modifications to the VLM architecture — actions are output as plain text using existing vocabulary It uses reinforcement learning instead of supervised learning

Chapter 3: Action as Text

Let's walk through the exact encoding pipeline that turns continuous robot actions into text and back.

Step 1: Normalize to [0, 1]

Each action dimension has a different physical range. The x-position might span [-0.5, 0.5] meters, while the gripper might be [0, 1]. First, VLA-0 normalizes every value to [0, 1] using the min and max observed in the training data:

a_norm = (a − a_min) / (a_max − a_min)

Step 2: Quantize to integers [0, B]

Multiply by B (default 1000) and round to the nearest integer:

a_int = round(a_norm × B)

This gives 1001 possible values per dimension — far more resolution than the 256 bins used by RT-2 or OpenVLA.

Step 3: Serialize to text

Concatenate all integers across all timesteps as a single space-separated string. For H=4 timesteps and D=7 dimensions:

# Continuous actions (4 timesteps x 7 dims)
[0.423, 0.017, 0.891, 0.500, 0.334, 0.672, 1.000,
 0.418, 0.022, 0.887, 0.503, 0.330, 0.668, 1.000,
 0.412, 0.028, 0.882, 0.507, 0.325, 0.664, 1.000,
 0.405, 0.035, 0.876, 0.512, 0.319, 0.659, 0.000]

# After quantization (B=1000)
"423 17 891 500 334 672 1000 418 22 887 503 330 668 1000 412 28 882 507 325 664 1000 405 35 876 512 319 659 0"

Step 4: Decode back

At inference, parse the output string, split on spaces, convert to integers, divide by B, then denormalize back to physical units.

Tensor shapes through the pipeline: Raw actions from dataset: (H, D) float32, e.g., (4, 7). Normalized: (H, D) float32 in [0, 1]. Quantized: (H, D) int in [0, 1000]. Serialized: single string of H×D tokens. The VLM predicts this string autoregressively, token by token. Total output tokens: 28 (for H=4, D=7) plus separating spaces.

Why B=1000? Ablations show diminishing returns above B=1000. At B=250, accuracy drops 1.5pp (too coarse). At B=4000, accuracy drops 0.5pp (the tokenizer splits large numbers into multiple tokens, hurting efficiency). B=1000 hits the sweet spot: high resolution, single-token integers for most values.

Why is B=1000 the default resolution rather than B=4000 or higher?

Higher B values cause numerical overflow At B=4000, large integers get split into multiple tokens by the tokenizer, reducing efficiency — while B=1000 gives high resolution with mostly single-token values The robot hardware only supports 1000 discrete positions

Chapter 4: The Recipe

Encoding actions as text is necessary but not sufficient. VLA-0 has two additional techniques that push it from good to state-of-the-art: ensemble prediction and masked action augmentation.

Ensemble Prediction (from ACT)

Imagine you are at timestep t=10. The model predicts H=4 timesteps ahead. So at t=10, you get predictions for t=10, 11, 12, 13. But at t=9, the model already predicted t=9, 10, 11, 12. And at t=8, it predicted t=8, 9, 10, 11.

For the action at t=10, you now have three predictions: one made at t=10 (fresh), one made at t=9 (1-step-old), and one made at t=8 (2-step-old). Ensemble averaging means taking the mean of all available predictions for each timestep.

a_t = (1/n) ∑_k=0^n-1 â_t|t-k

Where â_t|t-k is the prediction for timestep t made at timestep t-k, and n is the number of available predictions (capped at the ensemble window size).

Why ensembling helps (+2.0pp): Individual predictions are noisy. A single prediction might jitter the gripper or overshoot a position. By averaging across multiple predictions, random noise cancels out while the consistent signal (the correct action) is reinforced. This is the same principle behind ensemble methods everywhere in ML — applied to temporal action predictions.

Masked Action Augmentation (novel)

During training, VLA-0 randomly masks characters in the target action string. A target like "423 17 891" might become "4_3 _7 89_" where underscores are masked (replaced with a special mask token).

Why? Because without masking, the VLM can learn a shortcut: instead of looking at the image to determine the next action, it can simply pattern-match from the beginning of the action sequence. "If the first few numbers are 423 17, then the next is probably 891" — without ever looking at the image.

Masked augmentation forces visual grounding (+1.2pp): By randomly corrupting parts of the target sequence, the model cannot rely on auto-completing numerical patterns. It must attend to the visual observation to predict each action dimension correctly. This is a form of regularization that prevents the sequence-completion shortcut.

How masking works in practice

The masking operates at the character level, not the token level. This is a subtle but important distinction. Consider the target string "423 17 891". Token-level masking would replace entire numbers, leaving the model with no partial information. Character-level masking replaces individual digits: "4_3 _7 89_". The model still sees that the first number starts with 4 and ends with 3, but must infer the middle digit from the visual context.

The masking probability is tuned to corrupt enough of the sequence to prevent shortcuts while preserving enough structure for efficient learning. Too much masking slows convergence; too little lets the shortcut persist.

Action Prediction Pipeline

Toggle masked augmentation to see how training targets change. Adjust the ensemble window to see how overlapping predictions get averaged for smoother actions.

Ensemble window 3

Click to enable masked action augmentation

Which technique contributes more to VLA-0's performance?

Ensemble prediction (+2.0pp) is more impactful than masked augmentation (+1.2pp) Masked augmentation is more impactful They contribute equally

Chapter 5: Training

VLA-0's training setup is remarkably straightforward because it inherits everything from standard VLM fine-tuning. No special losses, no multi-stage training, no curriculum.

Base model

Qwen-VL-2.5 3B — a 3-billion parameter Vision-Language Model from Alibaba. It processes images at 448×448 resolution via a ViT encoder, projects visual tokens into the LLM's embedding space, and generates text autoregressively. VLA-0 uses it exactly as released, with no architectural changes.

Fine-tuning details

Method: Full fine-tuning (all parameters, not LoRA or adapter-based)
Loss: Standard cross-entropy on the action text tokens. The model predicts "423" character by character, just like predicting any other text.
Optimizer: Adam
Batch size: 192
Learning rate: 5×10^-6
Epochs: 64
Hardware: 8×A100 GPUs
Wall time: ~32 hours

The training data flow: Each training sample is (image, system_prompt, action_string). The image passes through the ViT encoder → visual tokens. The system prompt + action string are tokenized normally. Cross-entropy loss is computed only on the action tokens (the model should predict the action string given the image and prompt). Gradients flow through the entire model including the vision encoder.

What makes this simple

Compare to π0's training: it needs a separate flow-matching loss for the action head, careful balancing between language and action losses, and a warm-up stage for the randomly initialized head. Compare to OpenVLA-OFT: it needs a custom tokenizer training phase. VLA-0 needs none of this. It is standard supervised fine-tuning with cross-entropy loss. Any VLM training framework (HuggingFace, LLaMA-Factory, etc.) works out of the box.

No pre-training required: Most high-performing VLAs use large-scale action pre-training on datasets like Open X-Embodiment (OXE) before fine-tuning on specific tasks. VLA-0 achieves SOTA results with only task-specific fine-tuning — no pre-training on any external action dataset. This dramatically reduces the compute budget.

Inference

At inference time, VLA-0 runs standard autoregressive decoding. Given an image and the system prompt, it generates tokens one at a time until it has produced H×D integers. On an NVIDIA 5090 GPU, this runs at 4 Hz — adequate for many manipulation tasks, though slower than dedicated action heads that can run at 10+ Hz.

Why full fine-tuning beats LoRA

The paper uses full fine-tuning rather than parameter-efficient methods like LoRA. This is deliberate: the model needs to learn a fundamentally new task (mapping visual observations to motor commands), not just adapt its style. The visual encoder must learn to attend to task-relevant features (object positions, gripper state) rather than the high-level semantics it was pre-trained for. Full fine-tuning allows every layer to adjust.

Compute comparison: VLA-0 trains for 32 hours on 8×A100 (~256 GPU-hours). π0 uses large-scale pre-training on OXE (thousands of GPU-hours) before task-specific fine-tuning. OpenVLA-OFT similarly requires pre-training. VLA-0's total compute is 10-100x less than alternatives that use action pre-training, making it accessible to academic labs with modest GPU budgets.

What loss function does VLA-0 use for training?

Flow-matching loss on continuous action vectors Standard cross-entropy on action text tokens — the same loss used for any text generation task Mean squared error on predicted vs actual actions

Chapter 6: LIBERO Results

LIBERO is the primary benchmark — a simulation suite for robotic manipulation with four task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long. Each suite tests different capabilities. Average success rate across all four suites is the headline metric.

Without large-scale action pretraining

This is the fairest comparison — all methods trained only on LIBERO task data:

VLA-0 achieves 94.7% average success rate with rank 1.0 across all suites — beating every method including those with far more complex architectures. The closest competitor, π0.5-KI, reaches 93.3% (rank 2.3).

The full leaderboard:

Diffusion Policy: 72.4% (rank 6.5)
π0-FAST (Paligemma): 71.8% (rank 6.0)
SmolVLA 0.24B: 82.8% (rank 5.3)
SmolVLA 2.25B: 88.8% (rank 4.0)
OpenVLA-OFT: 91.9% (rank 2.8)
π0.5-KI: 93.3% (rank 2.3)
VLA-0: 94.7% (rank 1.0)

WITH large-scale action pretraining

Some methods use massive pre-training on Open X-Embodiment before fine-tuning on LIBERO. VLA-0 has no such pretraining, yet remains competitive:

π0: 94.2% (rank 3.3)
π0.5-KI: 94.3% (rank 3.0)
OpenVLA-OFT: 97.1% (rank 1.5)
VLA-0: 94.7% (rank 2.8) — no pretraining!

The remarkable finding: VLA-0 without any action pretraining (94.7%) outperforms π0 with large-scale pretraining (94.2%). A zero-modification 3B VLM fine-tuned for 32 hours beats a custom architecture pre-trained on orders of magnitude more data.

Breaking down the suites

LIBERO has four task suites, each testing different capabilities:

LIBERO-Spatial: Tasks requiring spatial reasoning — "put the bowl to the left of the plate." VLA-0 excels here because the preserved VLM understands spatial language naturally.
LIBERO-Object: Tasks requiring object identification — "pick up the red cup." The VLM's visual grounding transfers directly.
LIBERO-Goal: Tasks with specific goal states — "close the drawer." Requires understanding goal semantics from language.
LIBERO-Long: Multi-step tasks requiring sequential reasoning — "open the cabinet, pick up the bowl, and put it on the shelf." The most challenging suite; VLA-0's language understanding helps plan multi-step actions.

VLA-0 achieves rank 1.0 — meaning it is the best method on every single suite. No other method achieves this clean sweep.

How does VLA-0 (no pretraining) compare to π0 (with large-scale pretraining)?

π0 is significantly better due to its pretraining advantage VLA-0 without pretraining (94.7%) outperforms π0 with pretraining (94.2%) They are exactly equal

Chapter 7: Real-World Experiments

Simulation benchmarks are useful but not conclusive. Real robots introduce noise, latency, and physical dynamics that simulators cannot perfectly model. VLA-0 was tested on the SO-100 robot arm across four manipulation tasks.

The tasks

Pick and place: Grasp an object and move it to a target location
Stacking: Stack one object on top of another
Insertion: Insert a peg into a hole
Wiping: Wipe a surface with a cloth

VLA-0 vs SmolVLA

SmolVLA is the primary comparison because it was specifically pre-trained on SO-100 data — giving it a significant data advantage. Despite this:

VLA-0 outperforms SmolVLA by 12.5 percentage points on average (60% vs 47.5% success rate) across all four tasks. SmolVLA had SO-100-specific pretraining. VLA-0 had none.

This result matters because it shows the simplicity of VLA-0 does not come at the cost of real-world performance. The plain-text action representation transfers from simulation to physical hardware without degradation.

The sim-to-real gap

Many VLA methods that perform well in simulation underperform on real hardware. The main culprits: visual distribution shift (real lighting, backgrounds, and textures differ from simulation), action noise (real actuators have backlash, friction, and latency), and observation noise (real cameras have blur, reflections, and varying exposure). VLA-0's strong real-world performance suggests that preserving the VLM's visual reasoning helps bridge this gap.

Why does VLA-0 win?

The paper hypothesizes two reasons:

Preserved VLM reasoning: Because the architecture is unchanged, the VLM retains its full visual understanding capabilities. It can reason about object shapes, spatial relationships, and task semantics more effectively than models whose vision capabilities were degraded during action head training.
Ensemble smoothing: The temporal ensemble produces smoother trajectories on real hardware, which is more important in the physical world where jerky movements cause failures (e.g., dropping objects, missing insertion targets).

By how much does VLA-0 outperform SmolVLA on the real SO-100 robot, despite SmolVLA having SO-100-specific pretraining?

They perform about the same on real hardware VLA-0 outperforms SmolVLA by 12.5 percentage points (60% vs 47.5%) SmolVLA is slightly better due to its SO-100-specific pretraining

Chapter 8: Ablations

What actually matters in VLA-0's recipe? The paper systematically removes or changes each component and measures the impact on LIBERO average success rate (baseline: 94.7%).

The ablation results

Remove ensemble prediction: 92.0% (−2.7pp) — most impactful
Remove masked augmentation: 93.5% (−1.2pp)
Resolution B=250: 93.2% (−1.5pp) — too coarse
Resolution B=4000: 94.2% (−0.5pp) — slightly worse due to multi-token integers
Separate images vs tiled: 94.5% (−0.2pp) — negligible

The hierarchy of importance: Ensemble prediction >> Action resolution >> Masked augmentation >> Image format. The ensemble is the single most critical technique, contributing roughly twice the improvement of masking. Resolution matters in both directions — too coarse (250) is worse than too fine (4000).

What this tells us

Ensembling is king. Averaging multiple temporal predictions smooths noise and corrects individual errors. This +2.7pp gain is free at inference (you already have the predictions from previous timesteps).

Masking is cheap insurance. +1.2pp for randomly corrupting training targets is an excellent trade-off. It requires no additional compute, just a data augmentation step.

Resolution has a sweet spot. B=1000 is optimal. Going lower hurts precision; going higher hurts tokenization efficiency. This is a clean engineering insight: match your quantization resolution to your tokenizer's integer vocabulary.

Which ablation causes the largest drop in performance?

Removing ensemble prediction (−2.7pp) Reducing resolution to B=250 (−1.5pp) Removing masked augmentation (−1.2pp)

Chapter 9: Connections

VLA-0 sits at a fascinating intersection: it proves that the simplest approach can beat complex engineered alternatives, but it also raises questions about when simplicity will hit its limits.

Cheat sheet

VLA-0 in 30 seconds:

What: VLA with zero VLM modifications — actions as space-separated integers
Base model: Qwen-VL-2.5 3B, full fine-tuning
Key techniques: Ensemble prediction (+2.7pp), masked action augmentation (+1.2pp)
Resolution: B=1000 (normalize to [0,1000] integers)
Loss: Standard cross-entropy on text tokens
LIBERO: 94.7% (rank 1 without pretraining)
Real-world: Beats SmolVLA by 12.5pp on SO-100
Training: 32hr on 8×A100, Adam, lr=5e-6, batch 192
Inference: 4 Hz on 5090

Related work

π0 — Flow matching action heads. The generative action head approach that VLA-0 outperforms without any such head.
π0.5 — Scaling VLAs with massive action pretraining. VLA-0 matches its performance without pretraining.
VLA Foundry — Unified framework for training and comparing VLA architectures.
Diffusion Policy — The non-VLA baseline that VLA-0 far exceeds (94.7% vs 72.4%).
FAST — Custom action tokenization. A complex alternative in the "custom architecture" family.

Open questions

Scaling: VLA-0 uses a 3B model. Will the text-as-action approach continue to work at 7B, 13B, 72B? Or will larger models "overthink" the numerical generation?
Speed: 4 Hz is adequate for pick-and-place but too slow for dynamic tasks (catching, pouring). Can speculative decoding or parallel generation help?
Multimodal actions: The current approach handles joint positions. Can it extend to force/torque control, bimanual coordination, or locomotion?
Composability: Since VLA-0 preserves the VLM's language abilities, can it chain natural language reasoning with action generation in a single forward pass?
Multi-image context: Can VLA-0 use a history of images (not just the current frame) to improve temporal reasoning? The current setup uses a single image per timestep.

The bigger picture

VLA-0 fits into a broader trend in AI: foundation model reuse without modification. Just as RAG lets LLMs access knowledge without retraining, and in-context learning lets them perform tasks without fine-tuning, VLA-0 shows that VLMs can control robots without architectural surgery. The key insight is not that text-as-action is a clever hack — it is that VLMs are more capable than we assumed, and the engineering complexity of previous approaches was solving a problem that did not exist.

The meta-lesson: In ML engineering, the simplest approach that can possibly work often does work — and better than you expect. VLA-0 is a reminder that before designing complex architectures, you should try the obvious thing first. The VLM already speaks numbers. Let it.