VLA Foundry — Veanors

Chapter 0: The Problem

You want to train a Vision-Language-Action model — a robot that sees images, reads instructions, and outputs motor commands. Where do you start?

Today, most open-source VLA pipelines look like this: grab a pretrained LLM from one repo. Fine-tune it into a VLM using a second codebase with its own data format. Then attach an action head using a third codebase with yet another training loop. Each stage has different configs, different data loaders, different distributed training setups. When your VLA underperforms, you have no way to know whether the problem is in the LLM pretraining data, the VLM alignment stage, or the action head. The stages are black boxes to each other.

This isn't just inconvenient — it's scientifically crippling. The paper's key empirical finding is that decisions made during LLM and VLM pretraining directly affect downstream robot performance. If you can't control and ablate the full pipeline, you can't discover this. You're doing science with half the variables hidden.

The fragmentation tax

Consider the concrete costs of the current fragmented approach:

Reproducibility: When a VLA paper reports results, which pretraining checkpoint did they use? Which data mix? Which tokenizer version? These details are spread across 3+ repos and rarely documented together.
Ablation: Want to test whether more language pretraining data helps robot manipulation? You'd need to retrain the LLM (repo 1), then the VLM (repo 2), then the VLA (repo 3). Each with its own setup. Most researchers give up and just use whatever pretrained checkpoint is available.
Scale: Each codebase has different distributed training support. The LLM code might use DeepSpeed, the VLM code might use FSDP1, the VLA code might be single-GPU only. Scaling the full pipeline is a systems engineering nightmare.

The root cause: VLA training is not just the action learning stage. It's the full LLM → VLM → VLA pipeline. But no single codebase controls all three stages, so the most important ablations — the ones that span stages — are never run.

A concrete example of the problem

Say you're a researcher at a robotics lab. You want to test whether a VLM trained on more diverse image data makes a better robot. Here's your workflow today:

Find an LLM pretraining repo (say, LLaMA's). Pretrain a 1B model. Takes 2 days.
Find a VLM training repo (say, LLaVA's). Convert your LLM checkpoint to their format. Discover their tokenizer is different. Spend a day debugging. Train the VLM on your image data. Takes 1 day.
Find a VLA training repo (say, OpenVLA's). Convert your VLM checkpoint again. Their data loader expects RLDS format; yours is HDF5. Write a converter. Their distributed setup is DDP; you needed FSDP for the VLM. Adapt. Train the VLA. Takes 1 day.
Total: 5 days + 3 format conversions + 2 debugging sessions, just to test ONE ablation.

Now repeat for each data mix you want to test. Multiply by the number of ablations in a paper. You can see why most papers just download a pretrained VLM and skip the upstream ablation entirely.

What VLA Foundry changes

VLA Foundry is an open-source framework by Toyota Research Institute that unifies all three training stages in a single codebase. Same config system, same data loaders, same training loop, same distributed training infrastructure. One YAML file controls everything from the number of LLM pretraining tokens to the action chunk size in the VLA stage.

This means you can finally answer questions like: "Does training my LLM on 1T tokens vs 500B tokens affect robot success rate?" by changing a single config value and running the full pipeline end to end. No format conversions. No codebase switching. One command.

Why do most open-source VLA efforts struggle to ablate across training stages?

The hardware requirements are too large Each stage (LLM, VLM, VLA) uses a different codebase with incompatible configs, data formats, and training loops The math behind VLAs is too complex

Chapter 1: The Key Insight

VLA Foundry's core thesis is deceptively simple: the pipeline is the product.

Most robotics labs treat VLA training as "take a VLM, attach an action head." The upstream stages (LLM pretraining, VLM alignment) are someone else's problem — you just download a checkpoint and move on. VLA Foundry argues this is wrong. The quality of every upstream decision compounds into the final robot's performance.

Evidence: VLM backbone quality determines VLA quality

The paper's most striking result demonstrates this directly. They train two VLAs:

Foundry-VLA-1.7B: trained from scratch through the full LLM → VLM → VLA pipeline using their own 1.2B parameter LLM backbone.
Foundry-Qwen3VLA-2.1B: uses the same action head but swaps in a pretrained Qwen3-VL 2B as the VLM backbone — a model that saw vastly more vision-language data during its own pretraining.

The result? The Qwen3 backbone model outperforms the from-scratch model by a massive margin. Same action head, same training data, same training loop. The only difference is how much the VLM backbone already understood about the visual world before robot training began.

The implication is profound: If you want a better robot, don't just collect more robot data. Train a better VLM. And if you want a better VLM, train a better LLM. The pipeline flows downhill — improvements at any stage lift everything below it.

Why unification enables this discovery

This finding is only possible because VLA Foundry controls the full pipeline. With fragmented codebases, you can't hold everything constant except the VLM backbone and compare. You'd have different data preprocessing, different tokenizers, different optimizer states, different random seeds. The signal would drown in confounders.

VLA Foundry's design makes the comparison clean: swap a single config block (the model definition), keep everything else identical. The framework ensures that "everything else" really means everything — same data loading order, same normalization, same distributed strategy.

The full-stack mental model

Think of VLA training as a compiler pipeline. The LLM stage compiles text understanding. The VLM stage compiles visual grounding on top of that. The VLA stage compiles action generation on top of both. If your "linker" (VLM alignment) is buggy, it doesn't matter how good your "parser" (LLM) was. VLA Foundry is the integrated compiler that controls every stage.

This analogy extends further. Just as modern compilers perform whole-program optimization (optimizations that span compilation stages), VLA Foundry enables whole-pipeline optimization. You can adjust VLM training to produce features that are specifically useful for the downstream action head. In a fragmented setup, each stage optimizes in isolation, potentially making locally optimal but globally suboptimal choices.

What the unified pipeline costs

Unification isn't free. The full LLM → VLM → VLA pipeline takes significantly more compute than just training the VLA stage on a pretrained backbone. The from-scratch Foundry-VLA-1.7B required:

LLM pretraining: 1T tokens on multi-GPU cluster (the largest compute cost)
VLM alignment: 200M image-text pairs
VLA training: Robot demonstration data

For most practitioners, starting from a pretrained backbone (like Qwen3-VL) and only training the VLA stage is far more practical. The from-scratch pipeline is for research — understanding which upstream decisions matter. The practical deployment path is: pick the best available VLM, attach VLA Foundry's action head.

What is VLA Foundry's key empirical finding about VLA training?

More robot demonstration data always improves VLA performance A stronger VLM backbone leads to a stronger VLA, even with the same action head and robot data Discrete action tokens outperform continuous ones at scale

Chapter 2: Framework Architecture

Building a unified training framework sounds straightforward until you actually try it. You need to support wildly different data modalities (text, images, robot trajectories), different model architectures (decoder-only LLMs, encoder-decoder VLMs, diffusion action heads), and different training objectives (next-token prediction, contrastive learning, flow matching) — all without the codebase becoming an unmaintainable monolith.

VLA Foundry solves this with four design principles:

Principle 1: Modularity and Composability

Every component — model, dataset, optimizer, scheduler — is defined as a frozen dataclass using the Draccus library. These dataclasses are composed via YAML config files. Want to swap ViT-B for ViT-L? Change one line in the YAML. Want to mix two datasets at a 70/30 ratio? Specify it in config. No code changes required.

The key design choice is "frozen" dataclasses — once created, configs can't be mutated at runtime. This eliminates an entire class of bugs where training behavior changes depending on execution order.

Principle 2: Hackability

The training loop is deliberately thin: a simple for batch in dataloader: loss = model(batch); loss.backward(); optimizer.step() loop. No HuggingFace Trainer, no PyTorch Lightning, no heavy abstractions. If you need to add gradient accumulation or a custom logging hook, you edit ~10 lines of Python, not navigate a 5-level callback hierarchy.

This is a conscious design philosophy. Heavy training abstractions (HF Trainer, Lightning) are great for standard recipes but become walls when you need non-standard behavior — like co-training on text and robot data with different loss weights, or switching from next-token prediction to flow matching mid-pipeline. VLA Foundry optimizes for the researcher who needs to break the rules, not the engineer following a recipe.

Principle 3: Performance

Built on PyTorch's FSDP2 (Fully Sharded Data Parallel, second generation). Scales to 128 GPUs on P5 nodes (8x H100 each). Near-linear scaling verified up to that count with DDP. At the 1.2B parameter scale they use, plain DDP outperforms FSDP — sharding overhead isn't worth it for models that fit in GPU memory. But FSDP2 is there for when you scale up.

Principle 4: Reproducibility

Deterministic seeding controls all random number generators (Python, NumPy, PyTorch, CUDA). Dataloader state checkpointing means you can resume training from any checkpoint and get bit-identical results. This isn't just good practice — it's essential for ablation studies where you need to isolate the effect of a single variable.

Dataloader state checkpointing is the unsung hero here. Most training frameworks checkpoint model weights and optimizer state, but forget the dataloader. When you resume, data is reshuffled differently, meaning the model sees a different sequence of examples in the second half of training. For ablations, this introduces a confound: was the performance difference due to your config change, or due to different data ordering? VLA Foundry eliminates this by checkpointing the exact data cursor position, shuffle state, and worker state.

The 4 architectural layers: (1) YAML config system → (2) Registry for pluggable models and datasets → (3) Modality-specific dataloading → (4) Model-agnostic training loop. Each layer only talks to its neighbors. The training loop doesn't know whether it's training an LLM or a VLA.

Framework Layers

The four layers of VLA Foundry. Config flows down; gradients flow up.

Why does VLA Foundry use frozen dataclasses for configuration?

To improve GPU memory efficiency To prevent configs from being mutated at runtime, eliminating execution-order bugs and ensuring reproducibility To make the codebase compatible with JAX

Chapter 3: LLM → VLM: Building the Backbone

To prove that VLA Foundry can control the entire pipeline, the authors train a model from absolute scratch — starting with raw text data and ending with a robot that manipulates objects.

Stage 1: LLM pretraining

The LLM is a 1.2 billion parameter transformer decoder:

Parameter	Value
Hidden dimension	2048
Layers	24
Attention heads	16
Total params	1.2B
Training tokens	1 trillion (DCLM dataset)

Nothing exotic here — it's a standard autoregressive language model. The point isn't architectural novelty. The point is that this LLM was trained inside VLA Foundry, so every training decision is traceable and ablatable.

Stage 2: VLM alignment

To turn the LLM into a VLM, two components are added:

Vision encoder: An 86M parameter ViT (CLIP-style), processing 224x224 images. This produces a grid of patch tokens — 14x14 = 196 tokens per image at the standard patch size.
Pixel-shuffle pooling: 196 tokens per image is expensive in the LLM's context window, especially when you'll later need 8 images for VLA. Pixel-shuffle merges adjacent patch tokens spatially, reducing 196 tokens down to 64 tokens per image. This is a 3x compression with minimal information loss, because neighboring patches are highly correlated.

The VLM is trained on 200M image-text pairs from the DataComp-DR-1B dataset. Training objective: next-token prediction on text, conditioned on image tokens. Standard VLM recipe.

Why pixel-shuffle and not something else?

Alternatives for reducing visual token count include:

Average pooling: Destroys spatial information. A pooled 2x2 region can't distinguish "cat on left" from "cat on right."
Perceiver resampler: Learnable, but adds parameters and training complexity. Also non-deterministic at varying resolutions.
Pixel-shuffle: Rearranges a C×H×W feature map into a (C×r²)×(H/r)×(W/r) map. Lossless in information, just changes the spatial-channel trade-off. Then a linear projection reduces the channel dimension back.

Pixel-shuffle is the pragmatic choice: zero information loss, no learnable parameters, deterministic. It trades spatial resolution for channel depth, and the LLM's attention layers can recover spatial relationships from the richer channel features.

The token budget: With pixel-shuffle, each image costs 64 tokens. In the VLA stage, 8 images cost 512 tokens. Add ~30 tokens for the task instruction, 1 for the observation token, and the total context is ~543 tokens — easily manageable for a 2048-context LLM.

LLM → VLM → VLA Pipeline

Three training stages with throughput numbers. Click each stage to see details.

Stage

Why does VLA Foundry use pixel-shuffle pooling for visual tokens?

It losslessly compresses 196 visual tokens to 64 per image, keeping the LLM context length manageable for multi-image VLA input It improves image resolution beyond 224x224 It replaces the need for a ViT encoder entirely

Chapter 4: The Observation Token

This is the architectural idea that makes VLA Foundry tick. How do you connect a language model to a robot's motor system? You can't just take the last hidden state — that's optimized for predicting the next text token. You can't average all hidden states — that smears spatial and temporal information into mush. You need a dedicated extraction point that forces the VLM to compress everything the robot needs into one place.

The observation token

VLA Foundry adds a single new token to the vocabulary: <obs>. During VLA training, this token is appended to the end of the input sequence, after all image tokens and the task instruction. The VLM processes the full sequence — images, text, observation token — and at the observation token position, the hidden states encode a compressed representation of "what I see and what I need to do."

But here's the critical detail: VLA Foundry doesn't just take the hidden state from the last transformer layer. It takes the hidden states from the last 4 layers, concatenates them, and feeds this multi-scale representation to the action head.

Why last 4 layers, not just the last?

Different transformer layers encode different abstractions:

Earlier layers (layers 21-22 in a 24-layer model): encode spatial features — where objects are, their shapes, relative positions. Critical for precise manipulation.
Later layers (layers 23-24): encode semantic features — what the task is, which object to interact with. Critical for task understanding.

Taking only the last layer throws away the spatial precision from earlier layers. Taking all layers is wasteful — the first 10 layers mostly encode low-level token features irrelevant to actions. The last 4 layers hit the sweet spot: enough spatial detail for precise control, enough semantic abstraction for task understanding.

Complete data flow

Let's trace the exact tensor shapes through the architecture for 8-image input:

8 camera images (2 wrist + 2 external, each at 2 timesteps) → ViT encoder → 8 × 196 patch tokens
Pixel-shuffle → 8 × 64 = 512 visual tokens, each of dim 2048
Task text → ~30 language tokens
Observation token → 1 token (new vocab item)
VLM forward pass through all 24 layers → hidden states at obs token position from layers 21, 22, 23, 24: 4 × 2048 = 8192-dim vector
This 8192-dim vector → input to the 325M flow transformer action head

The observation token is a bottleneck by design. It forces the VLM to compress all task-relevant information (from 512 image tokens + 30 text tokens) into a single position. This is like CLS token in BERT, but for robot actions. The multi-layer extraction ensures the bottleneck doesn't lose spatial precision.

Architecture Data Flow (SHOWCASE)

Interactive architecture diagram. Toggle between 1-image and 8-image modes to see how token counts change.

Image mode

Why not cross-attention?

pi-0 uses cross-attention: dedicated action tokens attend to VLM tokens via separate attention heads. VLA Foundry's observation token approach is simpler — it uses the VLM's own self-attention to compress information, then hands off a fixed-size vector. This means the action head doesn't need to know the VLM's sequence length. You can change the number of images, the text length, even the VLM architecture — and the action head's input is always the same 8192-dim vector.

This decoupling is what makes VLA Foundry's "swap the backbone" experiments possible. The action head is backbone-agnostic by construction.

The camera setup: 8 images explained

The 8 images come from 4 cameras at 2 timesteps:

2 wrist cameras: Mounted on each robot arm's end-effector. These provide close-up views for precise grasping — seeing the exact finger-object contact.
2 external cameras: Fixed cameras providing workspace overview. These give spatial context — where objects are relative to each other and the robot.
× 2 timesteps: Current frame and one recent historical frame. This gives the model temporal information — the direction of motion, whether an object is being moved.

With pixel-shuffle at 64 tokens/image, 8 images cost 512 tokens. This is the dominant cost in the VLM's context window. Reducing to 4 images (single timestep) halves the visual tokens to 256 but loses temporal context. The 2-timestep design is a deliberate trade-off: minimal temporal information at acceptable token cost.

What does the observation token do in VLA Foundry's architecture?

It encodes the robot's proprioceptive state as a special token It provides a fixed extraction point where the VLM compresses all visual and textual information into a single position, whose multi-layer hidden states feed the action head It replaces the CLS token in the ViT encoder

Chapter 5: Flow Transformer Action Head

The observation token gives us a rich 8192-dimensional vector encoding what the robot sees and what it should do. Now we need to convert that into actual motor commands. This is the job of the flow transformer — a 325 million parameter model that generates continuous action trajectories via flow matching.

Why flow matching?

Robot actions are inherently multimodal. Given a cup on a table, there are multiple valid grasps: from the top, from the side, by the handle. A regression head that predicts a single action would average these modes, producing a grasp that aims at the center of the cup — between all valid grasps and belonging to none of them.

Flow matching solves this by learning a vector field that transports random noise to the distribution of valid actions. At inference, you sample noise and follow the field for a few steps. Different noise samples land at different modes of the action distribution. No averaging.

The flow transformer's inputs

The flow transformer is a standard transformer that takes three types of input tokens:

VLM features: The 8192-dim concatenated observation token features, projected to the flow transformer's hidden dimension.
Proprioception: Current joint angles + gripper state, projected via a linear layer to a single token.
Noised action sequence: A chunk of future actions (e.g., 16 timesteps × 7 DoF), each timestep projected to a token via a linear layer. At training time, these are ground-truth actions with Gaussian noise added. At inference, they start as pure noise.

All three types are concatenated into a single sequence and processed by the transformer. The output at the action token positions is the predicted velocity field — the direction to move from the noised action toward the clean action.

Action representation: SE(3) with 6D rotation

Each action is an SE(3) pose: 3D position + 3D rotation. But rotations are tricky. Euler angles have gimbal lock. Quaternions have antipodal equivalence (q and -q represent the same rotation — but their midpoint q=0 represents nothing). VLA Foundry uses the 6D continuous rotation representation from Zhou et al. (2019): take the first two columns of the 3x3 rotation matrix. The third column can be recovered via cross product. This representation is:

Continuous: Small changes in the 6 numbers produce small changes in the rotation. No discontinuities at gimbal lock angles.
Unique: Every rotation maps to exactly one 6D vector (unlike quaternions where q and -q are equivalent).
Network-friendly: The regression target is a smooth function of the rotation, so L2 loss works correctly.

Final action vector per timestep: 3 (position) + 6 (rotation) + 1 (gripper) = 10 dimensions. With a chunk size of 16 timesteps, the flow transformer outputs 160 continuous values per inference step.

Training: flow matching objective

Given a ground-truth action chunk a and Gaussian noise z, the noised sample at time t is:

a_t = (1 - t) · z + t · a

The flow transformer predicts the velocity field v_θ(a_t, t), and the loss is:

L = ||v_θ(a_t, t) - (a - z)||²

This is simply: "predict the direction from noise to clean action." Elegant and fast to compute.

Inference: 10 Euler steps

At inference time, the flow transformer runs 10 denoising steps using the Euler method:

Sample noise z ~ N(0, I) with the shape of the action chunk (e.g., 16 timesteps x 10 dims = 160 values)
Set a₀ = z
For t = 0, 0.1, 0.2, ..., 0.9: compute a_t+0.1 = a_t + 0.1 · v_θ(a_t, t)
After 10 steps, a_1.0 is the denoised action chunk

Why Euler and not a higher-order solver like Heun or RK4? Each step requires a full forward pass through the 325M parameter transformer. Higher-order solvers need 2-4 forward passes per step. At 10 Euler steps, that's 10 forward passes total. Heun would need 20. Since the flow transformer is called inside the robot's control loop, every additional forward pass increases latency.

The compute budget: The VLM forward pass (1.2B params) runs once to produce the observation token features. Then the flow transformer (325M params) runs 10 times for denoising. Total compute per control step: 1.2B + 10 × 325M = 4.45B parameter-forward-passes. This is manageable on a single H100 for real-time control.

Flow Matching Denoising

Watch noise transform into a smooth action trajectory. Hit Play to run 10 Euler steps.

Ready

What is the flow transformer's training objective?

Maximize the likelihood of discrete action tokens Predict the velocity field that transports noise toward clean actions (flow matching loss) Minimize the L2 distance between predicted and ground-truth actions directly

Chapter 6: Data Pipeline

A unified training framework is only as good as its data pipeline. VLA Foundry needs to handle three radically different data types — text, image-text pairs, and robot trajectories — through the same loading infrastructure. And it needs to do this at scale, across hundreds of GPUs, without becoming a bottleneck.

WebDataset for everything

All data is stored as WebDataset shards — tar files containing samples as adjacent files (e.g., 000001.jpg, 000001.json, 000001.actions.npy). Why tar files? Because they're sequential-read friendly. On cloud storage (S3, GCS), random access is expensive (one HTTP request per file). Sequential reads from tar files are ~10x faster and trivially parallelizable across workers.

Ray-parallel preprocessing

Converting raw robot datasets (often stored as HDF5, ROS bags, or RLDS) into WebDataset shards is a one-time cost. VLA Foundry uses Ray for parallel preprocessing, distributing the conversion across many CPU cores. This handles the heterogeneous formats from different data sources: OXE, DROID, in-house TRI data.

Normalization: the t-digest approach

Robot actions from different datasets have wildly different scales. One dataset might record joint velocities in radians/second; another uses end-effector positions in meters. You can't mix them without normalization.

VLA Foundry uses percentile-based normalization via the t-digest algorithm:

Compute the 1st and 99th percentile of each action dimension across the dataset.
Linearly map the 1st percentile to -1 and the 99th to +1.
Values outside [1st, 99th] are clipped.

Why percentiles instead of min/max or z-score? Because robot datasets have outliers — a single corrupted trajectory with joint velocities of 10,000 rad/s would destroy min/max normalization. Percentiles are robust to outliers by construction.

The t-digest data structure is mergeable: you can compute t-digests for each dataset independently, then merge them for the combined normalization. This enables dataset mixing without recomputing statistics from scratch.

Action representation details

Actions are represented as relative SE(3) transforms with action chunking:

Relative actions: Each action is a delta from the current pose, not an absolute pose. This makes the policy equivariant to the robot's starting position.
Action chunking: Instead of predicting one timestep, predict a window of future actions. Configurable past context (how many previous actions to condition on) and future window (how many to predict).
6D rotation: As described in Chapter 5, the first two columns of the rotation matrix. Continuous, no gimbal lock, no quaternion wrapping issues.

Data mixing for multi-task training: VLA Foundry supports mixing multiple datasets with configurable sampling weights. Each dataset can have its own normalization (computed independently, then merged via t-digest). The dataloader handles heterogeneous action dimensions by padding shorter action vectors.

Multi-modal co-training

During VLA training, VLA Foundry doesn't just train on robot data. It co-trains on a mix of robot trajectories and VLM data (image-text pairs). This prevents catastrophic forgetting of the VLM's visual understanding during action learning. The mixing ratio is a hyperparameter — too much VLM data slows action learning, too little causes the VLM features to degrade.

This is another benefit of the unified codebase: co-training across modalities is trivial because all data formats flow through the same loader. In a fragmented setup, you'd need to bridge two completely separate data pipelines.

Dataset mixing and sampling

When training on multiple robot datasets simultaneously, VLA Foundry supports configurable sampling weights. You might want 60% sim data and 40% real data, or 80% single-task and 20% cross-task. Each dataset's normalization statistics are computed independently and merged via t-digest, so adding a new dataset doesn't require recomputing global statistics.

The dataloader handles heterogeneous action dimensions gracefully: datasets with different numbers of joints are padded to a common dimensionality, with a mask indicating which dimensions are active. This lets you train a single model on data from a 6-DoF arm and a 7-DoF arm without custom engineering per dataset.

Why does VLA Foundry use t-digest percentile normalization instead of min/max?

t-digest is faster to compute Percentiles are robust to outliers, and t-digests can be merged across datasets without recomputation t-digest produces smaller normalization files

Chapter 7: Evaluation

How do you rigorously evaluate a VLA? Real-robot evaluation is expensive, noisy, and hard to reproduce. VLA Foundry uses the LBM benchmark — a simulation-based evaluation suite that tests manipulation at scale.

LBM benchmark

LBM (Large Behavior Model) is a benchmark with 49 bimanual manipulation tasks running in the Drake physics simulator. Tasks range from simple pick-and-place to complex multi-step sequences like assembling objects or manipulating deformable items.

Why simulation? Three reasons:

Scale: 200 rollout episodes per model per task. That's 9,800 episodes per model. Running this on real hardware would take weeks.
Reproducibility: Same initial conditions, same physics. No variance from hardware wear or environmental changes.
Speed: Drake runs faster than real-time, enabling rapid iteration.

Statistical analysis: STEP

Raw success rates are noisy, especially at 200 episodes per task. A model with 55% vs 50% success on a single task is likely within noise. VLA Foundry uses STEP (Statistical Testing for Evaluation of Policies) for rigorous comparison:

Bayesian estimates: Instead of point estimates (52.3% success), STEP produces posterior distributions. This lets you compute credible intervals and the probability that one model genuinely outperforms another.
CLD (Compact Letter Display): A multiple-comparison method that groups models into statistically distinguishable tiers. Models sharing a letter are not significantly different. This prevents cherry-picking: you can't claim your model is "better" if CLD puts them in the same group.

Why 200 episodes? With binary outcomes (success/fail), the standard error of a proportion p is sqrt(p(1-p)/n). At n=200 and p=0.5, the SE is ~3.5%. This means differences smaller than ~7 percentage points (2 SE) are within noise. The paper only claims significant differences when they exceed this threshold.

What the benchmark tests

The 49 tasks are grouped into categories:

Single-arm pick-and-place: Basic grasping and placement accuracy
Bimanual coordination: Two arms working together (holding + manipulating)
Tool use: Manipulating objects with tools (spatulas, tongs)
Deformable objects: Cloth, rope, soft materials
Multi-step sequences: Tasks requiring 3+ sequential subtasks

Real vs simulation evaluation

An important limitation: LBM is simulation-only. Performance on simulated tasks doesn't guarantee real-world success. The paper acknowledges this gap directly — models trained only on real data score ~0% on the sim benchmark, and vice versa. The sim benchmark measures sim performance, period.

That said, the benchmark is still valuable for comparing models against each other. If Model A beats Model B on 49 simulated tasks with rigorous statistical testing, it's strong evidence that A is architecturally superior. The absolute numbers may not transfer to real hardware, but the relative ranking likely does.

Throughput benchmarks

Beyond task performance, VLA Foundry benchmarks its own training throughput on P5 nodes (8x H100 GPUs per node):

Scale	Strategy	Throughput
8 GPUs (1 node)	DDP	Baseline
64 GPUs (8 nodes)	DDP	~7.8x (near-linear)
128 GPUs (16 nodes)	DDP	~15x (near-linear)
8 GPUs	FSDP2	Slightly slower than DDP

At 1.2B parameters, the model fits comfortably in a single GPU's memory, so FSDP's sharding adds communication overhead without benefit. FSDP2 is included for future scaling to larger models where sharding becomes necessary.

What is the purpose of CLD (Compact Letter Display) in VLA Foundry's evaluation?

It groups models into statistically distinguishable tiers, preventing claims of superiority when differences are within noise It compresses the model's action output for faster evaluation It selects which tasks to include in the benchmark

Chapter 8: Results

The experiments answer three critical questions: How good are the released models? Does VLM backbone quality matter? When does multi-task training help?

Question 1: Absolute performance

Model	LBM Score	Notes
Prior closed-source LBM	~45%	Baseline (proprietary)
Foundry-VLA-1.7B-MT-sim	~45%	From-scratch, on par with LBM
Foundry-Qwen3VLA-2.1B-MT	~68%	+23 pp over LBM (!)

The Qwen3-backbone model doesn't just match the prior state of the art — it crushes it by 23 percentage points. This is a massive gap, well beyond statistical noise.

Question 2: Does VLM backbone quality matter?

The from-scratch 1.2B model (trained on 1T text tokens + 200M image-text pairs) matches but doesn't exceed the prior baseline. The Qwen3-VL 2B model (pretrained on vastly more data by the Qwen team) dramatically outperforms it. Same action head, same robot data, same training loop. The only variable is how good the VLM backbone is.

This is the paper's main finding: VLM backbone quality is the dominant factor in VLA performance. Investing in better vision-language pretraining pays off more than collecting more robot data or improving the action head architecture.

Question 3: Multi-task vs single-task

Multi-task (MT) training trains on all 49 tasks simultaneously. Single-task (ST) trains a separate model per task. The results are nuanced:

Qwen3 backbone: MT + fine-tuning (FT) outperforms ST. The strong backbone can absorb the diversity of 49 tasks and find shared representations.
From-scratch backbone: MT is worse than ST. The weaker backbone can't absorb multi-task diversity — it gets confused trying to do everything at once.

This is a critical practical insight: multi-task training only helps when the backbone is strong enough. For weaker models, specialize.

Data domain ablations

Training Data	Sim Benchmark Result
Real-only	~0% (complete failure)
Sim-only	Best sim performance
Sim + real mixed	Worse than sim-only on sim benchmark

Real-only training transfers zero to simulation — the distribution shift between real and simulated visuals/dynamics is too large. And mixing real data into sim training actually hurts sim performance, because the model must accommodate two visual distributions instead of one.

The backbone quality gradient

Looking across all results, a clear hierarchy emerges:

Weakest: From-scratch 1.2B backbone + multi-task training. The backbone can't absorb task diversity.
Middle: From-scratch 1.2B backbone + single-task training. Specialization compensates for backbone weakness.
Strong: Qwen3-VL 2B backbone + single-task training. The strong backbone helps even without multi-task exposure.
Strongest: Qwen3-VL 2B backbone + multi-task + fine-tuning. The strong backbone absorbs diversity AND benefits from shared representations.

This gradient is the paper's most actionable finding. If you're building a VLA and choosing where to invest compute: upgrade your VLM backbone first. Only then invest in more robot data or fancier action heads.

Results Comparison

LBM benchmark scores. Higher is better.

When does multi-task training outperform single-task training in VLA Foundry?

Always, regardless of backbone quality Only when the VLM backbone is strong enough to absorb multi-task diversity (e.g., Qwen3), not for weaker from-scratch backbones Never — single-task always wins

Chapter 9: Connections

Cheat sheet

Aspect	VLA Foundry
Core idea	Unified LLM → VLM → VLA training in one codebase
Models released	Foundry-VLA-1.7B + Foundry-Qwen3VLA-2.1B
Action head	325M flow transformer (10 Euler steps)
Key mechanism	Observation token + multi-layer hidden state extraction
Action repr.	SE(3), 6D rotation, relative, chunked
Normalization	t-digest percentile, mergeable across datasets
Scale	Up to 128 GPUs (P5 nodes, 8xH100), near-linear DDP
Best result	Qwen3VLA outperforms LBM by +23 pp
Main finding	Stronger VLM backbone → stronger VLA
Config	YAML + Draccus frozen dataclasses

Related VLAs

Related lessons: pi-0 — flow matching for robot actions • pi-0.5 — scaling VLAs • Diffusion Policy — diffusion for robot control • UMI — universal manipulation interface • DiT — diffusion transformer architecture

What VLA Foundry gets right

Open-source full-stack control. The first framework where you can ablate LLM pretraining decisions and measure the effect on robot success rate. This alone makes it a valuable research tool, independent of the specific models released.
Observation token simplicity. No complex cross-attention between VLM and action head. A single bottleneck point with multi-layer extraction. This makes the action head backbone-agnostic, enabling clean comparison experiments.
Honest evaluation. STEP + CLD means statistical claims are rigorous. No cherry-picked task subsets. The paper shows failures (real-only → sim = 0%) alongside successes.
The pipeline thesis. By demonstrating that VLM backbone quality dominates VLA performance, the paper reframes the VLA research agenda. The bottleneck isn't action learning — it's vision-language understanding.

How it compares

Framework	LLM Stage	VLM Stage	VLA Stage	Unified?
OpenVLA	External	External	Yes	No (VLA only)
pi-0	External	External	Yes	No (VLA only)
Octo	External	Partial	Yes	Partial
VLA Foundry	Yes	Yes	Yes	Full pipeline

Open questions

Real-world transfer: Sim-only and real-only training don't transfer to each other. How to bridge this gap? Domain randomization and sim-to-real techniques are the obvious next step, but VLA Foundry doesn't address them yet.
Scale beyond 2B: The backbone quality finding suggests bigger VLMs = better VLAs. Where does this scaling hit diminishing returns? The jump from 1.2B (from-scratch) to ~2B (Qwen3) was dramatic, but will 7B or 13B backbones yield similar gains?
Planning and memory: Like pi-0, VLA Foundry has no explicit planning or memory mechanism. Multi-step tasks requiring temporal reasoning remain challenging.
Observation token vs cross-attention: pi-0's cross-attention approach gives the action head access to all VLM tokens. VLA Foundry's observation token compresses everything through a bottleneck. Which approach wins at scale? The answer likely depends on task complexity — precise assembly might need fine-grained spatial access that a bottleneck loses.
Action head scaling: The 325M flow transformer is relatively small. Would a larger action head benefit from the richer features a strong VLM provides? Or is the bottleneck elsewhere?

What is VLA Foundry's observation token most similar to, conceptually?

A positional encoding that marks the action head's position A CLS token in BERT — a single position that aggregates sequence-wide information for downstream use A special end-of-sequence token that triggers action generation