Goyal, Hadfield, Yang, Blukis & Ramos — NVIDIA, 2025

VLA-0: Zero Modification VLAs

The simplest VLA design — predict robot actions as plain text with no architectural changes — is surprisingly state-of-the-art.

Prerequisites: VLM basics + Robot manipulation concepts
10
Chapters
4+
Simulations

Chapter 0: The Problem

You want to build a robot that follows language instructions. You have a powerful Vision-Language Model — it can describe images, answer questions, write code. Now you want it to also output robot actions: move the arm here, close the gripper, rotate the wrist.

The obvious approach: take your VLM and add an action output. But how you add that output matters enormously. And every existing approach requires modifying the VLM in some way.

What people have tried

Every one of these modifications introduces engineering complexity, training instability, and the risk of catastrophic forgetting — where the VLM loses its original language and vision capabilities while learning to predict actions.

The uncomfortable question: What if all these modifications are unnecessary? What if a VLM can already predict robot actions — as plain text — without changing a single thing about its architecture?

That is exactly what VLA-0 shows. Take a VLM. Change nothing. Fine-tune it to output actions as space-separated integers. It outperforms every alternative.

What is the core risk of modifying a VLM's architecture to add action outputs?

Chapter 1: The Key Insight

Here is the insight that makes VLA-0 work: VLMs already know how to output numbers.

Think about it. A model like Qwen-VL-2.5 has been trained on billions of text tokens. It has seen coordinates, measurements, prices, dates, temperatures — numbers in every conceivable context. It knows the token "423" and the token "17" and how to produce sequences of space-separated integers.

A robot action is just a vector of numbers. A 7-DoF arm has 7 continuous values per timestep: x, y, z position, roll, pitch, yaw orientation, and gripper state. If you normalize each value to an integer in [0, 1000] and write them as text, you get something like:

423 17 891 500 334 672 1000

That is a perfectly valid text string. The VLM can generate it with standard autoregressive decoding. No new tokens. No action heads. No vocabulary modifications.

Why integers, not floats? Floating-point numbers like "0.4231" require multiple tokens per digit and introduce ambiguity in tokenization. The integer "423" is a single clean token in most vocabularies. By mapping continuous actions to the integer range [0, 1000], VLA-0 gets high resolution (0.1% per step) with maximally efficient tokenization.

The system prompt

VLA-0 uses a carefully designed system prompt that tells the VLM exactly what to output:

"Analyze the input image and predict robot actions
for the next H timesteps. Each action has D dimensions.
Output a single sequence of H x D integers (0-B each),
representing the H timesteps sequentially.
Provide only space-separated numbers. Nothing else."

For a 7-DoF arm predicting 4 timesteps ahead, the model outputs 28 integers: 423 17 891 500 334 672 1000 418 22 887 .... The first 7 are timestep 1, the next 7 are timestep 2, and so on.

The data flow: Image (224×224×3) + system prompt (text) → Qwen-VL-2.5 3B (frozen architecture, fine-tuned weights) → text string of H×D space-separated integers → parse to integers → denormalize to continuous values in original action range → send to robot.
Why does VLA-0 use integers instead of floating-point numbers for action representation?

Chapter 2: The VLA Zoo

Before we go deeper into VLA-0's method, let's understand where it fits in the landscape. The paper identifies four families of Vision-Language-Action models, each with a different strategy for getting actions out of a VLM.

Family 1: Discrete Token VLAs

Examples: RT-2, OpenVLA

These models add new tokens to the VLM's vocabulary — typically 256 per action dimension. Action prediction becomes next-token prediction over these special tokens. The problem: 256 bins is coarse resolution. And injecting new tokens into a pre-trained vocabulary corrupts the embedding space — the model has to learn what these tokens mean from scratch, which can interfere with existing language understanding.

Family 2: Generative Action Head VLAs

Examples: π0, SmolVLA

These keep the VLM's vocabulary untouched but attach a new neural network (a diffusion or flow-matching head) that reads the VLM's hidden states and generates continuous actions. The VLM produces a latent representation; the action head decodes it into motor commands. The problem: the action head is randomly initialized and must be trained from scratch. It introduces millions of new parameters that can degrade the VLM's language grounding during joint fine-tuning.

Family 3: Custom Architecture VLAs

Examples: OpenVLA-OFT, π-FAST

These design specialized modules — learned action tokenizers, custom codebooks, separate action decoders. They are often very effective but require significant engineering effort. Each new design choice (codebook size, tokenizer architecture, decoder depth) adds hyperparameters and potential failure modes.

Family 4: VLA-0 (Zero Modification)

The approach: Change nothing. Output actions as plain text integers. The VLM's existing vocabulary, architecture, and text generation pipeline are used as-is. Fine-tuning teaches the model what to output, not how to output.

Why does complexity seem necessary?

The assumption behind Families 1-3 is that robot actions are fundamentally different from text — they are continuous, high-frequency, and multi-dimensional. Therefore, the reasoning goes, you need specialized output mechanisms to handle them. Discrete token VLAs quantize actions into vocabulary tokens. Generative heads produce continuous outputs via learned denoising. Custom architectures build purpose-built decoders.

VLA-0 challenges this assumption: actions are just sequences of numbers, and VLMs already excel at generating sequences of numbers. The perceived gap between "text" and "actions" is an engineering choice, not a fundamental barrier.

The taxonomy principle: Families 1-3 modify the VLM's output mechanism to handle actions. VLA-0 modifies nothing — it reframes actions as text, which the VLM already knows how to produce. The simplicity is the innovation.
What distinguishes VLA-0 from all other VLA families?

Chapter 3: Action as Text

Let's walk through the exact encoding pipeline that turns continuous robot actions into text and back.

Step 1: Normalize to [0, 1]

Each action dimension has a different physical range. The x-position might span [-0.5, 0.5] meters, while the gripper might be [0, 1]. First, VLA-0 normalizes every value to [0, 1] using the min and max observed in the training data:

anorm = (a − amin) / (amax − amin)

Step 2: Quantize to integers [0, B]

Multiply by B (default 1000) and round to the nearest integer:

aint = round(anorm × B)

This gives 1001 possible values per dimension — far more resolution than the 256 bins used by RT-2 or OpenVLA.

Step 3: Serialize to text

Concatenate all integers across all timesteps as a single space-separated string. For H=4 timesteps and D=7 dimensions:

# Continuous actions (4 timesteps x 7 dims)
[0.423, 0.017, 0.891, 0.500, 0.334, 0.672, 1.000,
 0.418, 0.022, 0.887, 0.503, 0.330, 0.668, 1.000,
 0.412, 0.028, 0.882, 0.507, 0.325, 0.664, 1.000,
 0.405, 0.035, 0.876, 0.512, 0.319, 0.659, 0.000]

# After quantization (B=1000)
"423 17 891 500 334 672 1000 418 22 887 503 330 668 1000 412 28 882 507 325 664 1000 405 35 876 512 319 659 0"

Step 4: Decode back

At inference, parse the output string, split on spaces, convert to integers, divide by B, then denormalize back to physical units.

Tensor shapes through the pipeline: Raw actions from dataset: (H, D) float32, e.g., (4, 7). Normalized: (H, D) float32 in [0, 1]. Quantized: (H, D) int in [0, 1000]. Serialized: single string of H×D tokens. The VLM predicts this string autoregressively, token by token. Total output tokens: 28 (for H=4, D=7) plus separating spaces.
Why B=1000? Ablations show diminishing returns above B=1000. At B=250, accuracy drops 1.5pp (too coarse). At B=4000, accuracy drops 0.5pp (the tokenizer splits large numbers into multiple tokens, hurting efficiency). B=1000 hits the sweet spot: high resolution, single-token integers for most values.
Why is B=1000 the default resolution rather than B=4000 or higher?

Chapter 4: The Recipe

Encoding actions as text is necessary but not sufficient. VLA-0 has two additional techniques that push it from good to state-of-the-art: ensemble prediction and masked action augmentation.

Ensemble Prediction (from ACT)

Imagine you are at timestep t=10. The model predicts H=4 timesteps ahead. So at t=10, you get predictions for t=10, 11, 12, 13. But at t=9, the model already predicted t=9, 10, 11, 12. And at t=8, it predicted t=8, 9, 10, 11.

For the action at t=10, you now have three predictions: one made at t=10 (fresh), one made at t=9 (1-step-old), and one made at t=8 (2-step-old). Ensemble averaging means taking the mean of all available predictions for each timestep.

at = (1/n) ∑k=0n-1t|t-k

Where ât|t-k is the prediction for timestep t made at timestep t-k, and n is the number of available predictions (capped at the ensemble window size).

Why ensembling helps (+2.0pp): Individual predictions are noisy. A single prediction might jitter the gripper or overshoot a position. By averaging across multiple predictions, random noise cancels out while the consistent signal (the correct action) is reinforced. This is the same principle behind ensemble methods everywhere in ML — applied to temporal action predictions.

Masked Action Augmentation (novel)

During training, VLA-0 randomly masks characters in the target action string. A target like "423 17 891" might become "4_3 _7 89_" where underscores are masked (replaced with a special mask token).

Why? Because without masking, the VLM can learn a shortcut: instead of looking at the image to determine the next action, it can simply pattern-match from the beginning of the action sequence. "If the first few numbers are 423 17, then the next is probably 891" — without ever looking at the image.

Masked augmentation forces visual grounding (+1.2pp): By randomly corrupting parts of the target sequence, the model cannot rely on auto-completing numerical patterns. It must attend to the visual observation to predict each action dimension correctly. This is a form of regularization that prevents the sequence-completion shortcut.

How masking works in practice

The masking operates at the character level, not the token level. This is a subtle but important distinction. Consider the target string "423 17 891". Token-level masking would replace entire numbers, leaving the model with no partial information. Character-level masking replaces individual digits: "4_3 _7 89_". The model still sees that the first number starts with 4 and ends with 3, but must infer the middle digit from the visual context.

The masking probability is tuned to corrupt enough of the sequence to prevent shortcuts while preserving enough structure for efficient learning. Too much masking slows convergence; too little lets the shortcut persist.

Action Prediction Pipeline

Toggle masked augmentation to see how training targets change. Adjust the ensemble window to see how overlapping predictions get averaged for smoother actions.

Ensemble window 3
Click to enable masked action augmentation
Which technique contributes more to VLA-0's performance?

Chapter 5: Training

VLA-0's training setup is remarkably straightforward because it inherits everything from standard VLM fine-tuning. No special losses, no multi-stage training, no curriculum.

Base model

Qwen-VL-2.5 3B — a 3-billion parameter Vision-Language Model from Alibaba. It processes images at 448×448 resolution via a ViT encoder, projects visual tokens into the LLM's embedding space, and generates text autoregressively. VLA-0 uses it exactly as released, with no architectural changes.

Fine-tuning details

The training data flow: Each training sample is (image, system_prompt, action_string). The image passes through the ViT encoder → visual tokens. The system prompt + action string are tokenized normally. Cross-entropy loss is computed only on the action tokens (the model should predict the action string given the image and prompt). Gradients flow through the entire model including the vision encoder.

What makes this simple

Compare to π0's training: it needs a separate flow-matching loss for the action head, careful balancing between language and action losses, and a warm-up stage for the randomly initialized head. Compare to OpenVLA-OFT: it needs a custom tokenizer training phase. VLA-0 needs none of this. It is standard supervised fine-tuning with cross-entropy loss. Any VLM training framework (HuggingFace, LLaMA-Factory, etc.) works out of the box.

No pre-training required: Most high-performing VLAs use large-scale action pre-training on datasets like Open X-Embodiment (OXE) before fine-tuning on specific tasks. VLA-0 achieves SOTA results with only task-specific fine-tuning — no pre-training on any external action dataset. This dramatically reduces the compute budget.

Inference

At inference time, VLA-0 runs standard autoregressive decoding. Given an image and the system prompt, it generates tokens one at a time until it has produced H×D integers. On an NVIDIA 5090 GPU, this runs at 4 Hz — adequate for many manipulation tasks, though slower than dedicated action heads that can run at 10+ Hz.

Why full fine-tuning beats LoRA

The paper uses full fine-tuning rather than parameter-efficient methods like LoRA. This is deliberate: the model needs to learn a fundamentally new task (mapping visual observations to motor commands), not just adapt its style. The visual encoder must learn to attend to task-relevant features (object positions, gripper state) rather than the high-level semantics it was pre-trained for. Full fine-tuning allows every layer to adjust.

Compute comparison: VLA-0 trains for 32 hours on 8×A100 (~256 GPU-hours). π0 uses large-scale pre-training on OXE (thousands of GPU-hours) before task-specific fine-tuning. OpenVLA-OFT similarly requires pre-training. VLA-0's total compute is 10-100x less than alternatives that use action pre-training, making it accessible to academic labs with modest GPU budgets.
What loss function does VLA-0 use for training?

Chapter 6: LIBERO Results

LIBERO is the primary benchmark — a simulation suite for robotic manipulation with four task suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long. Each suite tests different capabilities. Average success rate across all four suites is the headline metric.

Without large-scale action pretraining

This is the fairest comparison — all methods trained only on LIBERO task data:

VLA-0 achieves 94.7% average success rate with rank 1.0 across all suites — beating every method including those with far more complex architectures. The closest competitor, π0.5-KI, reaches 93.3% (rank 2.3).

The full leaderboard:

WITH large-scale action pretraining

Some methods use massive pre-training on Open X-Embodiment before fine-tuning on LIBERO. VLA-0 has no such pretraining, yet remains competitive:

The remarkable finding: VLA-0 without any action pretraining (94.7%) outperforms π0 with large-scale pretraining (94.2%). A zero-modification 3B VLM fine-tuned for 32 hours beats a custom architecture pre-trained on orders of magnitude more data.

Breaking down the suites

LIBERO has four task suites, each testing different capabilities:

VLA-0 achieves rank 1.0 — meaning it is the best method on every single suite. No other method achieves this clean sweep.

How does VLA-0 (no pretraining) compare to π0 (with large-scale pretraining)?

Chapter 7: Real-World Experiments

Simulation benchmarks are useful but not conclusive. Real robots introduce noise, latency, and physical dynamics that simulators cannot perfectly model. VLA-0 was tested on the SO-100 robot arm across four manipulation tasks.

The tasks

VLA-0 vs SmolVLA

SmolVLA is the primary comparison because it was specifically pre-trained on SO-100 data — giving it a significant data advantage. Despite this:

VLA-0 outperforms SmolVLA by 12.5 percentage points on average (60% vs 47.5% success rate) across all four tasks. SmolVLA had SO-100-specific pretraining. VLA-0 had none.

This result matters because it shows the simplicity of VLA-0 does not come at the cost of real-world performance. The plain-text action representation transfers from simulation to physical hardware without degradation.

The sim-to-real gap

Many VLA methods that perform well in simulation underperform on real hardware. The main culprits: visual distribution shift (real lighting, backgrounds, and textures differ from simulation), action noise (real actuators have backlash, friction, and latency), and observation noise (real cameras have blur, reflections, and varying exposure). VLA-0's strong real-world performance suggests that preserving the VLM's visual reasoning helps bridge this gap.

Why does VLA-0 win?

The paper hypothesizes two reasons:

  1. Preserved VLM reasoning: Because the architecture is unchanged, the VLM retains its full visual understanding capabilities. It can reason about object shapes, spatial relationships, and task semantics more effectively than models whose vision capabilities were degraded during action head training.
  2. Ensemble smoothing: The temporal ensemble produces smoother trajectories on real hardware, which is more important in the physical world where jerky movements cause failures (e.g., dropping objects, missing insertion targets).
By how much does VLA-0 outperform SmolVLA on the real SO-100 robot, despite SmolVLA having SO-100-specific pretraining?

Chapter 8: Ablations

What actually matters in VLA-0's recipe? The paper systematically removes or changes each component and measures the impact on LIBERO average success rate (baseline: 94.7%).

The ablation results

The hierarchy of importance: Ensemble prediction >> Action resolution >> Masked augmentation >> Image format. The ensemble is the single most critical technique, contributing roughly twice the improvement of masking. Resolution matters in both directions — too coarse (250) is worse than too fine (4000).

What this tells us

Ensembling is king. Averaging multiple temporal predictions smooths noise and corrects individual errors. This +2.7pp gain is free at inference (you already have the predictions from previous timesteps).

Masking is cheap insurance. +1.2pp for randomly corrupting training targets is an excellent trade-off. It requires no additional compute, just a data augmentation step.

Resolution has a sweet spot. B=1000 is optimal. Going lower hurts precision; going higher hurts tokenization efficiency. This is a clean engineering insight: match your quantization resolution to your tokenizer's integer vocabulary.

Which ablation causes the largest drop in performance?

Chapter 9: Connections

VLA-0 sits at a fascinating intersection: it proves that the simplest approach can beat complex engineered alternatives, but it also raises questions about when simplicity will hit its limits.

Cheat sheet

VLA-0 in 30 seconds:
  • What: VLA with zero VLM modifications — actions as space-separated integers
  • Base model: Qwen-VL-2.5 3B, full fine-tuning
  • Key techniques: Ensemble prediction (+2.7pp), masked action augmentation (+1.2pp)
  • Resolution: B=1000 (normalize to [0,1000] integers)
  • Loss: Standard cross-entropy on text tokens
  • LIBERO: 94.7% (rank 1 without pretraining)
  • Real-world: Beats SmolVLA by 12.5pp on SO-100
  • Training: 32hr on 8×A100, Adam, lr=5e-6, batch 192
  • Inference: 4 Hz on 5090

Related work

Open questions

The bigger picture

VLA-0 fits into a broader trend in AI: foundation model reuse without modification. Just as RAG lets LLMs access knowledge without retraining, and in-context learning lets them perform tasks without fine-tuning, VLA-0 shows that VLMs can control robots without architectural surgery. The key insight is not that text-as-action is a clever hack — it is that VLMs are more capable than we assumed, and the engineering complexity of previous approaches was solving a problem that did not exist.

The meta-lesson: In ML engineering, the simplest approach that can possibly work often does work — and better than you expect. VLA-0 is a reminder that before designing complex architectures, you should try the obvious thing first. The VLM already speaks numbers. Let it.