A unified open-source framework for training LLMs, VLMs, and VLAs in a single codebase — from language pretraining to robot action learning.
You want to train a Vision-Language-Action model — a robot that sees images, reads instructions, and outputs motor commands. Where do you start?
Today, most open-source VLA pipelines look like this: grab a pretrained LLM from one repo. Fine-tune it into a VLM using a second codebase with its own data format. Then attach an action head using a third codebase with yet another training loop. Each stage has different configs, different data loaders, different distributed training setups. When your VLA underperforms, you have no way to know whether the problem is in the LLM pretraining data, the VLM alignment stage, or the action head. The stages are black boxes to each other.
This isn't just inconvenient — it's scientifically crippling. The paper's key empirical finding is that decisions made during LLM and VLM pretraining directly affect downstream robot performance. If you can't control and ablate the full pipeline, you can't discover this. You're doing science with half the variables hidden.
Consider the concrete costs of the current fragmented approach:
Say you're a researcher at a robotics lab. You want to test whether a VLM trained on more diverse image data makes a better robot. Here's your workflow today:
Now repeat for each data mix you want to test. Multiply by the number of ablations in a paper. You can see why most papers just download a pretrained VLM and skip the upstream ablation entirely.
VLA Foundry is an open-source framework by Toyota Research Institute that unifies all three training stages in a single codebase. Same config system, same data loaders, same training loop, same distributed training infrastructure. One YAML file controls everything from the number of LLM pretraining tokens to the action chunk size in the VLA stage.
This means you can finally answer questions like: "Does training my LLM on 1T tokens vs 500B tokens affect robot success rate?" by changing a single config value and running the full pipeline end to end. No format conversions. No codebase switching. One command.
VLA Foundry's core thesis is deceptively simple: the pipeline is the product.
Most robotics labs treat VLA training as "take a VLM, attach an action head." The upstream stages (LLM pretraining, VLM alignment) are someone else's problem — you just download a checkpoint and move on. VLA Foundry argues this is wrong. The quality of every upstream decision compounds into the final robot's performance.
The paper's most striking result demonstrates this directly. They train two VLAs:
The result? The Qwen3 backbone model outperforms the from-scratch model by a massive margin. Same action head, same training data, same training loop. The only difference is how much the VLM backbone already understood about the visual world before robot training began.
This finding is only possible because VLA Foundry controls the full pipeline. With fragmented codebases, you can't hold everything constant except the VLM backbone and compare. You'd have different data preprocessing, different tokenizers, different optimizer states, different random seeds. The signal would drown in confounders.
VLA Foundry's design makes the comparison clean: swap a single config block (the model definition), keep everything else identical. The framework ensures that "everything else" really means everything — same data loading order, same normalization, same distributed strategy.
Think of VLA training as a compiler pipeline. The LLM stage compiles text understanding. The VLM stage compiles visual grounding on top of that. The VLA stage compiles action generation on top of both. If your "linker" (VLM alignment) is buggy, it doesn't matter how good your "parser" (LLM) was. VLA Foundry is the integrated compiler that controls every stage.
This analogy extends further. Just as modern compilers perform whole-program optimization (optimizations that span compilation stages), VLA Foundry enables whole-pipeline optimization. You can adjust VLM training to produce features that are specifically useful for the downstream action head. In a fragmented setup, each stage optimizes in isolation, potentially making locally optimal but globally suboptimal choices.
Unification isn't free. The full LLM → VLM → VLA pipeline takes significantly more compute than just training the VLA stage on a pretrained backbone. The from-scratch Foundry-VLA-1.7B required:
For most practitioners, starting from a pretrained backbone (like Qwen3-VL) and only training the VLA stage is far more practical. The from-scratch pipeline is for research — understanding which upstream decisions matter. The practical deployment path is: pick the best available VLM, attach VLA Foundry's action head.
Building a unified training framework sounds straightforward until you actually try it. You need to support wildly different data modalities (text, images, robot trajectories), different model architectures (decoder-only LLMs, encoder-decoder VLMs, diffusion action heads), and different training objectives (next-token prediction, contrastive learning, flow matching) — all without the codebase becoming an unmaintainable monolith.
VLA Foundry solves this with four design principles:
Every component — model, dataset, optimizer, scheduler — is defined as a frozen dataclass using the Draccus library. These dataclasses are composed via YAML config files. Want to swap ViT-B for ViT-L? Change one line in the YAML. Want to mix two datasets at a 70/30 ratio? Specify it in config. No code changes required.
The key design choice is "frozen" dataclasses — once created, configs can't be mutated at runtime. This eliminates an entire class of bugs where training behavior changes depending on execution order.
The training loop is deliberately thin: a simple for batch in dataloader: loss = model(batch); loss.backward(); optimizer.step() loop. No HuggingFace Trainer, no PyTorch Lightning, no heavy abstractions. If you need to add gradient accumulation or a custom logging hook, you edit ~10 lines of Python, not navigate a 5-level callback hierarchy.
This is a conscious design philosophy. Heavy training abstractions (HF Trainer, Lightning) are great for standard recipes but become walls when you need non-standard behavior — like co-training on text and robot data with different loss weights, or switching from next-token prediction to flow matching mid-pipeline. VLA Foundry optimizes for the researcher who needs to break the rules, not the engineer following a recipe.
Built on PyTorch's FSDP2 (Fully Sharded Data Parallel, second generation). Scales to 128 GPUs on P5 nodes (8x H100 each). Near-linear scaling verified up to that count with DDP. At the 1.2B parameter scale they use, plain DDP outperforms FSDP — sharding overhead isn't worth it for models that fit in GPU memory. But FSDP2 is there for when you scale up.
Deterministic seeding controls all random number generators (Python, NumPy, PyTorch, CUDA). Dataloader state checkpointing means you can resume training from any checkpoint and get bit-identical results. This isn't just good practice — it's essential for ablation studies where you need to isolate the effect of a single variable.
Dataloader state checkpointing is the unsung hero here. Most training frameworks checkpoint model weights and optimizer state, but forget the dataloader. When you resume, data is reshuffled differently, meaning the model sees a different sequence of examples in the second half of training. For ablations, this introduces a confound: was the performance difference due to your config change, or due to different data ordering? VLA Foundry eliminates this by checkpointing the exact data cursor position, shuffle state, and worker state.
The four layers of VLA Foundry. Config flows down; gradients flow up.
To prove that VLA Foundry can control the entire pipeline, the authors train a model from absolute scratch — starting with raw text data and ending with a robot that manipulates objects.
The LLM is a 1.2 billion parameter transformer decoder:
| Parameter | Value |
|---|---|
| Hidden dimension | 2048 |
| Layers | 24 |
| Attention heads | 16 |
| Total params | 1.2B |
| Training tokens | 1 trillion (DCLM dataset) |
Nothing exotic here — it's a standard autoregressive language model. The point isn't architectural novelty. The point is that this LLM was trained inside VLA Foundry, so every training decision is traceable and ablatable.
To turn the LLM into a VLM, two components are added:
The VLM is trained on 200M image-text pairs from the DataComp-DR-1B dataset. Training objective: next-token prediction on text, conditioned on image tokens. Standard VLM recipe.
Alternatives for reducing visual token count include:
Pixel-shuffle is the pragmatic choice: zero information loss, no learnable parameters, deterministic. It trades spatial resolution for channel depth, and the LLM's attention layers can recover spatial relationships from the richer channel features.
Three training stages with throughput numbers. Click each stage to see details.
This is the architectural idea that makes VLA Foundry tick. How do you connect a language model to a robot's motor system? You can't just take the last hidden state — that's optimized for predicting the next text token. You can't average all hidden states — that smears spatial and temporal information into mush. You need a dedicated extraction point that forces the VLM to compress everything the robot needs into one place.
VLA Foundry adds a single new token to the vocabulary: <obs>. During VLA training, this token is appended to the end of the input sequence, after all image tokens and the task instruction. The VLM processes the full sequence — images, text, observation token — and at the observation token position, the hidden states encode a compressed representation of "what I see and what I need to do."
But here's the critical detail: VLA Foundry doesn't just take the hidden state from the last transformer layer. It takes the hidden states from the last 4 layers, concatenates them, and feeds this multi-scale representation to the action head.
Different transformer layers encode different abstractions:
Taking only the last layer throws away the spatial precision from earlier layers. Taking all layers is wasteful — the first 10 layers mostly encode low-level token features irrelevant to actions. The last 4 layers hit the sweet spot: enough spatial detail for precise control, enough semantic abstraction for task understanding.
Let's trace the exact tensor shapes through the architecture for 8-image input:
Interactive architecture diagram. Toggle between 1-image and 8-image modes to see how token counts change.
pi-0 uses cross-attention: dedicated action tokens attend to VLM tokens via separate attention heads. VLA Foundry's observation token approach is simpler — it uses the VLM's own self-attention to compress information, then hands off a fixed-size vector. This means the action head doesn't need to know the VLM's sequence length. You can change the number of images, the text length, even the VLM architecture — and the action head's input is always the same 8192-dim vector.
This decoupling is what makes VLA Foundry's "swap the backbone" experiments possible. The action head is backbone-agnostic by construction.
The 8 images come from 4 cameras at 2 timesteps:
With pixel-shuffle at 64 tokens/image, 8 images cost 512 tokens. This is the dominant cost in the VLM's context window. Reducing to 4 images (single timestep) halves the visual tokens to 256 but loses temporal context. The 2-timestep design is a deliberate trade-off: minimal temporal information at acceptable token cost.
The observation token gives us a rich 8192-dimensional vector encoding what the robot sees and what it should do. Now we need to convert that into actual motor commands. This is the job of the flow transformer — a 325 million parameter model that generates continuous action trajectories via flow matching.
Robot actions are inherently multimodal. Given a cup on a table, there are multiple valid grasps: from the top, from the side, by the handle. A regression head that predicts a single action would average these modes, producing a grasp that aims at the center of the cup — between all valid grasps and belonging to none of them.
Flow matching solves this by learning a vector field that transports random noise to the distribution of valid actions. At inference, you sample noise and follow the field for a few steps. Different noise samples land at different modes of the action distribution. No averaging.
The flow transformer is a standard transformer that takes three types of input tokens:
All three types are concatenated into a single sequence and processed by the transformer. The output at the action token positions is the predicted velocity field — the direction to move from the noised action toward the clean action.
Each action is an SE(3) pose: 3D position + 3D rotation. But rotations are tricky. Euler angles have gimbal lock. Quaternions have antipodal equivalence (q and -q represent the same rotation — but their midpoint q=0 represents nothing). VLA Foundry uses the 6D continuous rotation representation from Zhou et al. (2019): take the first two columns of the 3x3 rotation matrix. The third column can be recovered via cross product. This representation is:
Final action vector per timestep: 3 (position) + 6 (rotation) + 1 (gripper) = 10 dimensions. With a chunk size of 16 timesteps, the flow transformer outputs 160 continuous values per inference step.
Given a ground-truth action chunk a and Gaussian noise z, the noised sample at time t is:
The flow transformer predicts the velocity field vθ(at, t), and the loss is:
This is simply: "predict the direction from noise to clean action." Elegant and fast to compute.
At inference time, the flow transformer runs 10 denoising steps using the Euler method:
Why Euler and not a higher-order solver like Heun or RK4? Each step requires a full forward pass through the 325M parameter transformer. Higher-order solvers need 2-4 forward passes per step. At 10 Euler steps, that's 10 forward passes total. Heun would need 20. Since the flow transformer is called inside the robot's control loop, every additional forward pass increases latency.
Watch noise transform into a smooth action trajectory. Hit Play to run 10 Euler steps.
A unified training framework is only as good as its data pipeline. VLA Foundry needs to handle three radically different data types — text, image-text pairs, and robot trajectories — through the same loading infrastructure. And it needs to do this at scale, across hundreds of GPUs, without becoming a bottleneck.
All data is stored as WebDataset shards — tar files containing samples as adjacent files (e.g., 000001.jpg, 000001.json, 000001.actions.npy). Why tar files? Because they're sequential-read friendly. On cloud storage (S3, GCS), random access is expensive (one HTTP request per file). Sequential reads from tar files are ~10x faster and trivially parallelizable across workers.
Converting raw robot datasets (often stored as HDF5, ROS bags, or RLDS) into WebDataset shards is a one-time cost. VLA Foundry uses Ray for parallel preprocessing, distributing the conversion across many CPU cores. This handles the heterogeneous formats from different data sources: OXE, DROID, in-house TRI data.
Robot actions from different datasets have wildly different scales. One dataset might record joint velocities in radians/second; another uses end-effector positions in meters. You can't mix them without normalization.
VLA Foundry uses percentile-based normalization via the t-digest algorithm:
Why percentiles instead of min/max or z-score? Because robot datasets have outliers — a single corrupted trajectory with joint velocities of 10,000 rad/s would destroy min/max normalization. Percentiles are robust to outliers by construction.
The t-digest data structure is mergeable: you can compute t-digests for each dataset independently, then merge them for the combined normalization. This enables dataset mixing without recomputing statistics from scratch.
Actions are represented as relative SE(3) transforms with action chunking:
During VLA training, VLA Foundry doesn't just train on robot data. It co-trains on a mix of robot trajectories and VLM data (image-text pairs). This prevents catastrophic forgetting of the VLM's visual understanding during action learning. The mixing ratio is a hyperparameter — too much VLM data slows action learning, too little causes the VLM features to degrade.
This is another benefit of the unified codebase: co-training across modalities is trivial because all data formats flow through the same loader. In a fragmented setup, you'd need to bridge two completely separate data pipelines.
When training on multiple robot datasets simultaneously, VLA Foundry supports configurable sampling weights. You might want 60% sim data and 40% real data, or 80% single-task and 20% cross-task. Each dataset's normalization statistics are computed independently and merged via t-digest, so adding a new dataset doesn't require recomputing global statistics.
The dataloader handles heterogeneous action dimensions gracefully: datasets with different numbers of joints are padded to a common dimensionality, with a mask indicating which dimensions are active. This lets you train a single model on data from a 6-DoF arm and a 7-DoF arm without custom engineering per dataset.
How do you rigorously evaluate a VLA? Real-robot evaluation is expensive, noisy, and hard to reproduce. VLA Foundry uses the LBM benchmark — a simulation-based evaluation suite that tests manipulation at scale.
LBM (Large Behavior Model) is a benchmark with 49 bimanual manipulation tasks running in the Drake physics simulator. Tasks range from simple pick-and-place to complex multi-step sequences like assembling objects or manipulating deformable items.
Why simulation? Three reasons:
Raw success rates are noisy, especially at 200 episodes per task. A model with 55% vs 50% success on a single task is likely within noise. VLA Foundry uses STEP (Statistical Testing for Evaluation of Policies) for rigorous comparison:
The 49 tasks are grouped into categories:
An important limitation: LBM is simulation-only. Performance on simulated tasks doesn't guarantee real-world success. The paper acknowledges this gap directly — models trained only on real data score ~0% on the sim benchmark, and vice versa. The sim benchmark measures sim performance, period.
That said, the benchmark is still valuable for comparing models against each other. If Model A beats Model B on 49 simulated tasks with rigorous statistical testing, it's strong evidence that A is architecturally superior. The absolute numbers may not transfer to real hardware, but the relative ranking likely does.
Beyond task performance, VLA Foundry benchmarks its own training throughput on P5 nodes (8x H100 GPUs per node):
| Scale | Strategy | Throughput |
|---|---|---|
| 8 GPUs (1 node) | DDP | Baseline |
| 64 GPUs (8 nodes) | DDP | ~7.8x (near-linear) |
| 128 GPUs (16 nodes) | DDP | ~15x (near-linear) |
| 8 GPUs | FSDP2 | Slightly slower than DDP |
At 1.2B parameters, the model fits comfortably in a single GPU's memory, so FSDP's sharding adds communication overhead without benefit. FSDP2 is included for future scaling to larger models where sharding becomes necessary.
The experiments answer three critical questions: How good are the released models? Does VLM backbone quality matter? When does multi-task training help?
| Model | LBM Score | Notes |
|---|---|---|
| Prior closed-source LBM | ~45% | Baseline (proprietary) |
| Foundry-VLA-1.7B-MT-sim | ~45% | From-scratch, on par with LBM |
| Foundry-Qwen3VLA-2.1B-MT | ~68% | +23 pp over LBM (!) |
The Qwen3-backbone model doesn't just match the prior state of the art — it crushes it by 23 percentage points. This is a massive gap, well beyond statistical noise.
The from-scratch 1.2B model (trained on 1T text tokens + 200M image-text pairs) matches but doesn't exceed the prior baseline. The Qwen3-VL 2B model (pretrained on vastly more data by the Qwen team) dramatically outperforms it. Same action head, same robot data, same training loop. The only variable is how good the VLM backbone is.
Multi-task (MT) training trains on all 49 tasks simultaneously. Single-task (ST) trains a separate model per task. The results are nuanced:
This is a critical practical insight: multi-task training only helps when the backbone is strong enough. For weaker models, specialize.
| Training Data | Sim Benchmark Result |
|---|---|
| Real-only | ~0% (complete failure) |
| Sim-only | Best sim performance |
| Sim + real mixed | Worse than sim-only on sim benchmark |
Real-only training transfers zero to simulation — the distribution shift between real and simulated visuals/dynamics is too large. And mixing real data into sim training actually hurts sim performance, because the model must accommodate two visual distributions instead of one.
Looking across all results, a clear hierarchy emerges:
This gradient is the paper's most actionable finding. If you're building a VLA and choosing where to invest compute: upgrade your VLM backbone first. Only then invest in more robot data or fancier action heads.
LBM benchmark scores. Higher is better.
| Aspect | VLA Foundry |
|---|---|
| Core idea | Unified LLM → VLM → VLA training in one codebase |
| Models released | Foundry-VLA-1.7B + Foundry-Qwen3VLA-2.1B |
| Action head | 325M flow transformer (10 Euler steps) |
| Key mechanism | Observation token + multi-layer hidden state extraction |
| Action repr. | SE(3), 6D rotation, relative, chunked |
| Normalization | t-digest percentile, mergeable across datasets |
| Scale | Up to 128 GPUs (P5 nodes, 8xH100), near-linear DDP |
| Best result | Qwen3VLA outperforms LBM by +23 pp |
| Main finding | Stronger VLM backbone → stronger VLA |
| Config | YAML + Draccus frozen dataclasses |
| Framework | LLM Stage | VLM Stage | VLA Stage | Unified? |
|---|---|---|---|---|
| OpenVLA | External | External | Yes | No (VLA only) |
| pi-0 | External | External | Yes | No (VLA only) |
| Octo | External | Partial | Yes | Partial |
| VLA Foundry | Yes | Yes | Yes | Full pipeline |