Mercat, Keh, Arora, Huang, Shah et al. — TRI, 2026

VLA Foundry

A unified open-source framework for training LLMs, VLMs, and VLAs in a single codebase — from language pretraining to robot action learning.

Prerequisites: LLM training + VLMs + flow matching basics
10
Chapters
4+
Simulations

Chapter 0: The Problem

You want to train a Vision-Language-Action model — a robot that sees images, reads instructions, and outputs motor commands. Where do you start?

Today, most open-source VLA pipelines look like this: grab a pretrained LLM from one repo. Fine-tune it into a VLM using a second codebase with its own data format. Then attach an action head using a third codebase with yet another training loop. Each stage has different configs, different data loaders, different distributed training setups. When your VLA underperforms, you have no way to know whether the problem is in the LLM pretraining data, the VLM alignment stage, or the action head. The stages are black boxes to each other.

This isn't just inconvenient — it's scientifically crippling. The paper's key empirical finding is that decisions made during LLM and VLM pretraining directly affect downstream robot performance. If you can't control and ablate the full pipeline, you can't discover this. You're doing science with half the variables hidden.

The fragmentation tax

Consider the concrete costs of the current fragmented approach:

The root cause: VLA training is not just the action learning stage. It's the full LLM → VLM → VLA pipeline. But no single codebase controls all three stages, so the most important ablations — the ones that span stages — are never run.

A concrete example of the problem

Say you're a researcher at a robotics lab. You want to test whether a VLM trained on more diverse image data makes a better robot. Here's your workflow today:

  1. Find an LLM pretraining repo (say, LLaMA's). Pretrain a 1B model. Takes 2 days.
  2. Find a VLM training repo (say, LLaVA's). Convert your LLM checkpoint to their format. Discover their tokenizer is different. Spend a day debugging. Train the VLM on your image data. Takes 1 day.
  3. Find a VLA training repo (say, OpenVLA's). Convert your VLM checkpoint again. Their data loader expects RLDS format; yours is HDF5. Write a converter. Their distributed setup is DDP; you needed FSDP for the VLM. Adapt. Train the VLA. Takes 1 day.
  4. Total: 5 days + 3 format conversions + 2 debugging sessions, just to test ONE ablation.

Now repeat for each data mix you want to test. Multiply by the number of ablations in a paper. You can see why most papers just download a pretrained VLM and skip the upstream ablation entirely.

What VLA Foundry changes

VLA Foundry is an open-source framework by Toyota Research Institute that unifies all three training stages in a single codebase. Same config system, same data loaders, same training loop, same distributed training infrastructure. One YAML file controls everything from the number of LLM pretraining tokens to the action chunk size in the VLA stage.

This means you can finally answer questions like: "Does training my LLM on 1T tokens vs 500B tokens affect robot success rate?" by changing a single config value and running the full pipeline end to end. No format conversions. No codebase switching. One command.

Why do most open-source VLA efforts struggle to ablate across training stages?

Chapter 1: The Key Insight

VLA Foundry's core thesis is deceptively simple: the pipeline is the product.

Most robotics labs treat VLA training as "take a VLM, attach an action head." The upstream stages (LLM pretraining, VLM alignment) are someone else's problem — you just download a checkpoint and move on. VLA Foundry argues this is wrong. The quality of every upstream decision compounds into the final robot's performance.

Evidence: VLM backbone quality determines VLA quality

The paper's most striking result demonstrates this directly. They train two VLAs:

  1. Foundry-VLA-1.7B: trained from scratch through the full LLM → VLM → VLA pipeline using their own 1.2B parameter LLM backbone.
  2. Foundry-Qwen3VLA-2.1B: uses the same action head but swaps in a pretrained Qwen3-VL 2B as the VLM backbone — a model that saw vastly more vision-language data during its own pretraining.

The result? The Qwen3 backbone model outperforms the from-scratch model by a massive margin. Same action head, same training data, same training loop. The only difference is how much the VLM backbone already understood about the visual world before robot training began.

The implication is profound: If you want a better robot, don't just collect more robot data. Train a better VLM. And if you want a better VLM, train a better LLM. The pipeline flows downhill — improvements at any stage lift everything below it.

Why unification enables this discovery

This finding is only possible because VLA Foundry controls the full pipeline. With fragmented codebases, you can't hold everything constant except the VLM backbone and compare. You'd have different data preprocessing, different tokenizers, different optimizer states, different random seeds. The signal would drown in confounders.

VLA Foundry's design makes the comparison clean: swap a single config block (the model definition), keep everything else identical. The framework ensures that "everything else" really means everything — same data loading order, same normalization, same distributed strategy.

The full-stack mental model

Think of VLA training as a compiler pipeline. The LLM stage compiles text understanding. The VLM stage compiles visual grounding on top of that. The VLA stage compiles action generation on top of both. If your "linker" (VLM alignment) is buggy, it doesn't matter how good your "parser" (LLM) was. VLA Foundry is the integrated compiler that controls every stage.

This analogy extends further. Just as modern compilers perform whole-program optimization (optimizations that span compilation stages), VLA Foundry enables whole-pipeline optimization. You can adjust VLM training to produce features that are specifically useful for the downstream action head. In a fragmented setup, each stage optimizes in isolation, potentially making locally optimal but globally suboptimal choices.

What the unified pipeline costs

Unification isn't free. The full LLM → VLM → VLA pipeline takes significantly more compute than just training the VLA stage on a pretrained backbone. The from-scratch Foundry-VLA-1.7B required:

For most practitioners, starting from a pretrained backbone (like Qwen3-VL) and only training the VLA stage is far more practical. The from-scratch pipeline is for research — understanding which upstream decisions matter. The practical deployment path is: pick the best available VLM, attach VLA Foundry's action head.

What is VLA Foundry's key empirical finding about VLA training?

Chapter 2: Framework Architecture

Building a unified training framework sounds straightforward until you actually try it. You need to support wildly different data modalities (text, images, robot trajectories), different model architectures (decoder-only LLMs, encoder-decoder VLMs, diffusion action heads), and different training objectives (next-token prediction, contrastive learning, flow matching) — all without the codebase becoming an unmaintainable monolith.

VLA Foundry solves this with four design principles:

Principle 1: Modularity and Composability

Every component — model, dataset, optimizer, scheduler — is defined as a frozen dataclass using the Draccus library. These dataclasses are composed via YAML config files. Want to swap ViT-B for ViT-L? Change one line in the YAML. Want to mix two datasets at a 70/30 ratio? Specify it in config. No code changes required.

The key design choice is "frozen" dataclasses — once created, configs can't be mutated at runtime. This eliminates an entire class of bugs where training behavior changes depending on execution order.

Principle 2: Hackability

The training loop is deliberately thin: a simple for batch in dataloader: loss = model(batch); loss.backward(); optimizer.step() loop. No HuggingFace Trainer, no PyTorch Lightning, no heavy abstractions. If you need to add gradient accumulation or a custom logging hook, you edit ~10 lines of Python, not navigate a 5-level callback hierarchy.

This is a conscious design philosophy. Heavy training abstractions (HF Trainer, Lightning) are great for standard recipes but become walls when you need non-standard behavior — like co-training on text and robot data with different loss weights, or switching from next-token prediction to flow matching mid-pipeline. VLA Foundry optimizes for the researcher who needs to break the rules, not the engineer following a recipe.

Principle 3: Performance

Built on PyTorch's FSDP2 (Fully Sharded Data Parallel, second generation). Scales to 128 GPUs on P5 nodes (8x H100 each). Near-linear scaling verified up to that count with DDP. At the 1.2B parameter scale they use, plain DDP outperforms FSDP — sharding overhead isn't worth it for models that fit in GPU memory. But FSDP2 is there for when you scale up.

Principle 4: Reproducibility

Deterministic seeding controls all random number generators (Python, NumPy, PyTorch, CUDA). Dataloader state checkpointing means you can resume training from any checkpoint and get bit-identical results. This isn't just good practice — it's essential for ablation studies where you need to isolate the effect of a single variable.

Dataloader state checkpointing is the unsung hero here. Most training frameworks checkpoint model weights and optimizer state, but forget the dataloader. When you resume, data is reshuffled differently, meaning the model sees a different sequence of examples in the second half of training. For ablations, this introduces a confound: was the performance difference due to your config change, or due to different data ordering? VLA Foundry eliminates this by checkpointing the exact data cursor position, shuffle state, and worker state.

The 4 architectural layers: (1) YAML config system → (2) Registry for pluggable models and datasets → (3) Modality-specific dataloading → (4) Model-agnostic training loop. Each layer only talks to its neighbors. The training loop doesn't know whether it's training an LLM or a VLA.
Framework Layers

The four layers of VLA Foundry. Config flows down; gradients flow up.

Why does VLA Foundry use frozen dataclasses for configuration?

Chapter 3: LLM → VLM: Building the Backbone

To prove that VLA Foundry can control the entire pipeline, the authors train a model from absolute scratch — starting with raw text data and ending with a robot that manipulates objects.

Stage 1: LLM pretraining

The LLM is a 1.2 billion parameter transformer decoder:

ParameterValue
Hidden dimension2048
Layers24
Attention heads16
Total params1.2B
Training tokens1 trillion (DCLM dataset)

Nothing exotic here — it's a standard autoregressive language model. The point isn't architectural novelty. The point is that this LLM was trained inside VLA Foundry, so every training decision is traceable and ablatable.

Stage 2: VLM alignment

To turn the LLM into a VLM, two components are added:

  1. Vision encoder: An 86M parameter ViT (CLIP-style), processing 224x224 images. This produces a grid of patch tokens — 14x14 = 196 tokens per image at the standard patch size.
  2. Pixel-shuffle pooling: 196 tokens per image is expensive in the LLM's context window, especially when you'll later need 8 images for VLA. Pixel-shuffle merges adjacent patch tokens spatially, reducing 196 tokens down to 64 tokens per image. This is a 3x compression with minimal information loss, because neighboring patches are highly correlated.

The VLM is trained on 200M image-text pairs from the DataComp-DR-1B dataset. Training objective: next-token prediction on text, conditioned on image tokens. Standard VLM recipe.

Why pixel-shuffle and not something else?

Alternatives for reducing visual token count include:

Pixel-shuffle is the pragmatic choice: zero information loss, no learnable parameters, deterministic. It trades spatial resolution for channel depth, and the LLM's attention layers can recover spatial relationships from the richer channel features.

The token budget: With pixel-shuffle, each image costs 64 tokens. In the VLA stage, 8 images cost 512 tokens. Add ~30 tokens for the task instruction, 1 for the observation token, and the total context is ~543 tokens — easily manageable for a 2048-context LLM.
LLM → VLM → VLA Pipeline

Three training stages with throughput numbers. Click each stage to see details.

Stage
Why does VLA Foundry use pixel-shuffle pooling for visual tokens?

Chapter 4: The Observation Token

This is the architectural idea that makes VLA Foundry tick. How do you connect a language model to a robot's motor system? You can't just take the last hidden state — that's optimized for predicting the next text token. You can't average all hidden states — that smears spatial and temporal information into mush. You need a dedicated extraction point that forces the VLM to compress everything the robot needs into one place.

The observation token

VLA Foundry adds a single new token to the vocabulary: <obs>. During VLA training, this token is appended to the end of the input sequence, after all image tokens and the task instruction. The VLM processes the full sequence — images, text, observation token — and at the observation token position, the hidden states encode a compressed representation of "what I see and what I need to do."

But here's the critical detail: VLA Foundry doesn't just take the hidden state from the last transformer layer. It takes the hidden states from the last 4 layers, concatenates them, and feeds this multi-scale representation to the action head.

Why last 4 layers, not just the last?

Different transformer layers encode different abstractions:

Taking only the last layer throws away the spatial precision from earlier layers. Taking all layers is wasteful — the first 10 layers mostly encode low-level token features irrelevant to actions. The last 4 layers hit the sweet spot: enough spatial detail for precise control, enough semantic abstraction for task understanding.

Complete data flow

Let's trace the exact tensor shapes through the architecture for 8-image input:

  1. 8 camera images (2 wrist + 2 external, each at 2 timesteps) → ViT encoder → 8 × 196 patch tokens
  2. Pixel-shuffle → 8 × 64 = 512 visual tokens, each of dim 2048
  3. Task text → ~30 language tokens
  4. Observation token → 1 token (new vocab item)
  5. VLM forward pass through all 24 layers → hidden states at obs token position from layers 21, 22, 23, 24: 4 × 2048 = 8192-dim vector
  6. This 8192-dim vector → input to the 325M flow transformer action head
The observation token is a bottleneck by design. It forces the VLM to compress all task-relevant information (from 512 image tokens + 30 text tokens) into a single position. This is like CLS token in BERT, but for robot actions. The multi-layer extraction ensures the bottleneck doesn't lose spatial precision.
Architecture Data Flow (SHOWCASE)

Interactive architecture diagram. Toggle between 1-image and 8-image modes to see how token counts change.

Image mode

Why not cross-attention?

pi-0 uses cross-attention: dedicated action tokens attend to VLM tokens via separate attention heads. VLA Foundry's observation token approach is simpler — it uses the VLM's own self-attention to compress information, then hands off a fixed-size vector. This means the action head doesn't need to know the VLM's sequence length. You can change the number of images, the text length, even the VLM architecture — and the action head's input is always the same 8192-dim vector.

This decoupling is what makes VLA Foundry's "swap the backbone" experiments possible. The action head is backbone-agnostic by construction.

The camera setup: 8 images explained

The 8 images come from 4 cameras at 2 timesteps:

With pixel-shuffle at 64 tokens/image, 8 images cost 512 tokens. This is the dominant cost in the VLM's context window. Reducing to 4 images (single timestep) halves the visual tokens to 256 but loses temporal context. The 2-timestep design is a deliberate trade-off: minimal temporal information at acceptable token cost.

What does the observation token do in VLA Foundry's architecture?

Chapter 5: Flow Transformer Action Head

The observation token gives us a rich 8192-dimensional vector encoding what the robot sees and what it should do. Now we need to convert that into actual motor commands. This is the job of the flow transformer — a 325 million parameter model that generates continuous action trajectories via flow matching.

Why flow matching?

Robot actions are inherently multimodal. Given a cup on a table, there are multiple valid grasps: from the top, from the side, by the handle. A regression head that predicts a single action would average these modes, producing a grasp that aims at the center of the cup — between all valid grasps and belonging to none of them.

Flow matching solves this by learning a vector field that transports random noise to the distribution of valid actions. At inference, you sample noise and follow the field for a few steps. Different noise samples land at different modes of the action distribution. No averaging.

The flow transformer's inputs

The flow transformer is a standard transformer that takes three types of input tokens:

  1. VLM features: The 8192-dim concatenated observation token features, projected to the flow transformer's hidden dimension.
  2. Proprioception: Current joint angles + gripper state, projected via a linear layer to a single token.
  3. Noised action sequence: A chunk of future actions (e.g., 16 timesteps × 7 DoF), each timestep projected to a token via a linear layer. At training time, these are ground-truth actions with Gaussian noise added. At inference, they start as pure noise.

All three types are concatenated into a single sequence and processed by the transformer. The output at the action token positions is the predicted velocity field — the direction to move from the noised action toward the clean action.

Action representation: SE(3) with 6D rotation

Each action is an SE(3) pose: 3D position + 3D rotation. But rotations are tricky. Euler angles have gimbal lock. Quaternions have antipodal equivalence (q and -q represent the same rotation — but their midpoint q=0 represents nothing). VLA Foundry uses the 6D continuous rotation representation from Zhou et al. (2019): take the first two columns of the 3x3 rotation matrix. The third column can be recovered via cross product. This representation is:

Final action vector per timestep: 3 (position) + 6 (rotation) + 1 (gripper) = 10 dimensions. With a chunk size of 16 timesteps, the flow transformer outputs 160 continuous values per inference step.

Training: flow matching objective

Given a ground-truth action chunk a and Gaussian noise z, the noised sample at time t is:

at = (1 - t) · z + t · a

The flow transformer predicts the velocity field vθ(at, t), and the loss is:

L = ||vθ(at, t) - (a - z)||²

This is simply: "predict the direction from noise to clean action." Elegant and fast to compute.

Inference: 10 Euler steps

At inference time, the flow transformer runs 10 denoising steps using the Euler method:

  1. Sample noise z ~ N(0, I) with the shape of the action chunk (e.g., 16 timesteps x 10 dims = 160 values)
  2. Set a0 = z
  3. For t = 0, 0.1, 0.2, ..., 0.9: compute at+0.1 = at + 0.1 · vθ(at, t)
  4. After 10 steps, a1.0 is the denoised action chunk

Why Euler and not a higher-order solver like Heun or RK4? Each step requires a full forward pass through the 325M parameter transformer. Higher-order solvers need 2-4 forward passes per step. At 10 Euler steps, that's 10 forward passes total. Heun would need 20. Since the flow transformer is called inside the robot's control loop, every additional forward pass increases latency.

The compute budget: The VLM forward pass (1.2B params) runs once to produce the observation token features. Then the flow transformer (325M params) runs 10 times for denoising. Total compute per control step: 1.2B + 10 × 325M = 4.45B parameter-forward-passes. This is manageable on a single H100 for real-time control.
Flow Matching Denoising

Watch noise transform into a smooth action trajectory. Hit Play to run 10 Euler steps.

Ready
What is the flow transformer's training objective?

Chapter 6: Data Pipeline

A unified training framework is only as good as its data pipeline. VLA Foundry needs to handle three radically different data types — text, image-text pairs, and robot trajectories — through the same loading infrastructure. And it needs to do this at scale, across hundreds of GPUs, without becoming a bottleneck.

WebDataset for everything

All data is stored as WebDataset shards — tar files containing samples as adjacent files (e.g., 000001.jpg, 000001.json, 000001.actions.npy). Why tar files? Because they're sequential-read friendly. On cloud storage (S3, GCS), random access is expensive (one HTTP request per file). Sequential reads from tar files are ~10x faster and trivially parallelizable across workers.

Ray-parallel preprocessing

Converting raw robot datasets (often stored as HDF5, ROS bags, or RLDS) into WebDataset shards is a one-time cost. VLA Foundry uses Ray for parallel preprocessing, distributing the conversion across many CPU cores. This handles the heterogeneous formats from different data sources: OXE, DROID, in-house TRI data.

Normalization: the t-digest approach

Robot actions from different datasets have wildly different scales. One dataset might record joint velocities in radians/second; another uses end-effector positions in meters. You can't mix them without normalization.

VLA Foundry uses percentile-based normalization via the t-digest algorithm:

  1. Compute the 1st and 99th percentile of each action dimension across the dataset.
  2. Linearly map the 1st percentile to -1 and the 99th to +1.
  3. Values outside [1st, 99th] are clipped.

Why percentiles instead of min/max or z-score? Because robot datasets have outliers — a single corrupted trajectory with joint velocities of 10,000 rad/s would destroy min/max normalization. Percentiles are robust to outliers by construction.

The t-digest data structure is mergeable: you can compute t-digests for each dataset independently, then merge them for the combined normalization. This enables dataset mixing without recomputing statistics from scratch.

Action representation details

Actions are represented as relative SE(3) transforms with action chunking:

Data mixing for multi-task training: VLA Foundry supports mixing multiple datasets with configurable sampling weights. Each dataset can have its own normalization (computed independently, then merged via t-digest). The dataloader handles heterogeneous action dimensions by padding shorter action vectors.

Multi-modal co-training

During VLA training, VLA Foundry doesn't just train on robot data. It co-trains on a mix of robot trajectories and VLM data (image-text pairs). This prevents catastrophic forgetting of the VLM's visual understanding during action learning. The mixing ratio is a hyperparameter — too much VLM data slows action learning, too little causes the VLM features to degrade.

This is another benefit of the unified codebase: co-training across modalities is trivial because all data formats flow through the same loader. In a fragmented setup, you'd need to bridge two completely separate data pipelines.

Dataset mixing and sampling

When training on multiple robot datasets simultaneously, VLA Foundry supports configurable sampling weights. You might want 60% sim data and 40% real data, or 80% single-task and 20% cross-task. Each dataset's normalization statistics are computed independently and merged via t-digest, so adding a new dataset doesn't require recomputing global statistics.

The dataloader handles heterogeneous action dimensions gracefully: datasets with different numbers of joints are padded to a common dimensionality, with a mask indicating which dimensions are active. This lets you train a single model on data from a 6-DoF arm and a 7-DoF arm without custom engineering per dataset.

Why does VLA Foundry use t-digest percentile normalization instead of min/max?

Chapter 7: Evaluation

How do you rigorously evaluate a VLA? Real-robot evaluation is expensive, noisy, and hard to reproduce. VLA Foundry uses the LBM benchmark — a simulation-based evaluation suite that tests manipulation at scale.

LBM benchmark

LBM (Large Behavior Model) is a benchmark with 49 bimanual manipulation tasks running in the Drake physics simulator. Tasks range from simple pick-and-place to complex multi-step sequences like assembling objects or manipulating deformable items.

Why simulation? Three reasons:

Statistical analysis: STEP

Raw success rates are noisy, especially at 200 episodes per task. A model with 55% vs 50% success on a single task is likely within noise. VLA Foundry uses STEP (Statistical Testing for Evaluation of Policies) for rigorous comparison:

Why 200 episodes? With binary outcomes (success/fail), the standard error of a proportion p is sqrt(p(1-p)/n). At n=200 and p=0.5, the SE is ~3.5%. This means differences smaller than ~7 percentage points (2 SE) are within noise. The paper only claims significant differences when they exceed this threshold.

What the benchmark tests

The 49 tasks are grouped into categories:

Real vs simulation evaluation

An important limitation: LBM is simulation-only. Performance on simulated tasks doesn't guarantee real-world success. The paper acknowledges this gap directly — models trained only on real data score ~0% on the sim benchmark, and vice versa. The sim benchmark measures sim performance, period.

That said, the benchmark is still valuable for comparing models against each other. If Model A beats Model B on 49 simulated tasks with rigorous statistical testing, it's strong evidence that A is architecturally superior. The absolute numbers may not transfer to real hardware, but the relative ranking likely does.

Throughput benchmarks

Beyond task performance, VLA Foundry benchmarks its own training throughput on P5 nodes (8x H100 GPUs per node):

ScaleStrategyThroughput
8 GPUs (1 node)DDPBaseline
64 GPUs (8 nodes)DDP~7.8x (near-linear)
128 GPUs (16 nodes)DDP~15x (near-linear)
8 GPUsFSDP2Slightly slower than DDP

At 1.2B parameters, the model fits comfortably in a single GPU's memory, so FSDP's sharding adds communication overhead without benefit. FSDP2 is included for future scaling to larger models where sharding becomes necessary.

What is the purpose of CLD (Compact Letter Display) in VLA Foundry's evaluation?

Chapter 8: Results

The experiments answer three critical questions: How good are the released models? Does VLM backbone quality matter? When does multi-task training help?

Question 1: Absolute performance

ModelLBM ScoreNotes
Prior closed-source LBM~45%Baseline (proprietary)
Foundry-VLA-1.7B-MT-sim~45%From-scratch, on par with LBM
Foundry-Qwen3VLA-2.1B-MT~68%+23 pp over LBM (!)

The Qwen3-backbone model doesn't just match the prior state of the art — it crushes it by 23 percentage points. This is a massive gap, well beyond statistical noise.

Question 2: Does VLM backbone quality matter?

The from-scratch 1.2B model (trained on 1T text tokens + 200M image-text pairs) matches but doesn't exceed the prior baseline. The Qwen3-VL 2B model (pretrained on vastly more data by the Qwen team) dramatically outperforms it. Same action head, same robot data, same training loop. The only variable is how good the VLM backbone is.

This is the paper's main finding: VLM backbone quality is the dominant factor in VLA performance. Investing in better vision-language pretraining pays off more than collecting more robot data or improving the action head architecture.

Question 3: Multi-task vs single-task

Multi-task (MT) training trains on all 49 tasks simultaneously. Single-task (ST) trains a separate model per task. The results are nuanced:

This is a critical practical insight: multi-task training only helps when the backbone is strong enough. For weaker models, specialize.

Data domain ablations

Training DataSim Benchmark Result
Real-only~0% (complete failure)
Sim-onlyBest sim performance
Sim + real mixedWorse than sim-only on sim benchmark

Real-only training transfers zero to simulation — the distribution shift between real and simulated visuals/dynamics is too large. And mixing real data into sim training actually hurts sim performance, because the model must accommodate two visual distributions instead of one.

The backbone quality gradient

Looking across all results, a clear hierarchy emerges:

  1. Weakest: From-scratch 1.2B backbone + multi-task training. The backbone can't absorb task diversity.
  2. Middle: From-scratch 1.2B backbone + single-task training. Specialization compensates for backbone weakness.
  3. Strong: Qwen3-VL 2B backbone + single-task training. The strong backbone helps even without multi-task exposure.
  4. Strongest: Qwen3-VL 2B backbone + multi-task + fine-tuning. The strong backbone absorbs diversity AND benefits from shared representations.

This gradient is the paper's most actionable finding. If you're building a VLA and choosing where to invest compute: upgrade your VLM backbone first. Only then invest in more robot data or fancier action heads.

Results Comparison

LBM benchmark scores. Higher is better.

When does multi-task training outperform single-task training in VLA Foundry?

Chapter 9: Connections

Cheat sheet

AspectVLA Foundry
Core ideaUnified LLM → VLM → VLA training in one codebase
Models releasedFoundry-VLA-1.7B + Foundry-Qwen3VLA-2.1B
Action head325M flow transformer (10 Euler steps)
Key mechanismObservation token + multi-layer hidden state extraction
Action repr.SE(3), 6D rotation, relative, chunked
Normalizationt-digest percentile, mergeable across datasets
ScaleUp to 128 GPUs (P5 nodes, 8xH100), near-linear DDP
Best resultQwen3VLA outperforms LBM by +23 pp
Main findingStronger VLM backbone → stronger VLA
ConfigYAML + Draccus frozen dataclasses

Related VLAs

Related lessons: pi-0 — flow matching for robot actions • pi-0.5 — scaling VLAs • Diffusion Policy — diffusion for robot control • UMI — universal manipulation interface • DiT — diffusion transformer architecture

What VLA Foundry gets right

How it compares

FrameworkLLM StageVLM StageVLA StageUnified?
OpenVLAExternalExternalYesNo (VLA only)
pi-0ExternalExternalYesNo (VLA only)
OctoExternalPartialYesPartial
VLA FoundryYesYesYesFull pipeline

Open questions

What is VLA Foundry's observation token most similar to, conceptually?