How foundation models learned to move robots — translating pixels and instructions into physical actions in the real world.
Vision-Language Models (VLMs) can look at an image and answer questions. But what if instead of answering with words, the model answered with motor commands? That's the leap from VLM to VLA (Vision-Language-Action): a model that sees the world, understands a language instruction, and outputs physical actions.
The insight is deceptively simple: if a VLM can generate the token sequence "pick up the red cup," why can't it instead generate the action sequence [move_to(0.3, 0.5, 0.2), close_gripper()]? A VLA treats actions as just another modality — another kind of output token.
Let's trace the actual data flow. A camera captures an image: [3, 224, 224] — an RGB frame. A ViT vision encoder (like SigLIP) splits this into 14×14 = 196 patches and encodes each to a 768-dimensional vector, producing 196 visual tokens. Meanwhile, the language instruction "pick up the red cup" passes through a tokenizer into ~20-50 text tokens. These are concatenated and fed to the transformer backbone. The output: an action vector, typically 7 numbers for a robotic arm — 6 for the end-effector pose (x, y, z position + roll, pitch, yaw rotation) plus 1 for the gripper (open/close).
Watch how the same backbone produces different outputs: text for VLMs, actions for VLAs.
The architectural pattern is identical: multimodal encoder → shared backbone → autoregressive decoder. The only difference is the output vocabulary. A VLM decodes from a 32K-token text vocabulary. A VLA decodes from a 256-bin action vocabulary (per dimension). Same attention, same context window, same next-token prediction loss. This is why VLAs can bootstrap from pre-trained VLMs — the backbone already knows how to fuse vision and language.
Where else have you seen "same architecture, different output head"? Think about how masked language models become classifiers with a single linear layer swap.
The simplest way to teach a robot: show it what to do. A human demonstrates the task (teleoperation), recording observations and actions. Then we train a neural network to predict the expert's action given the current observation. This is behavioral cloning (BC) — supervised learning for robotics.
In practice, BC training data looks like this: each episode is a sequence of (observation, action) pairs. The observation is a camera image [3, 224, 224]. The action is a 7-DOF vector [Δx, Δy, Δz, Δrx, Δry, Δrz, gripper]. A 10-second demonstration at 10Hz gives 100 training pairs. Typical VLA datasets have 100K-1M such episodes.
BC is simple but fragile. If the robot drifts slightly off the demonstrated path, it encounters states it has never seen before — the distribution shift problem. Small errors compound, and the robot spirals into failure.
The teal path is the expert demo. The red path is the BC agent. Watch errors compound as it drifts away.
How bad is compounding error in practice? If each step has a 1% chance of a small drift, after 100 steps the probability of being on-track is 0.99100 = 36%. After 500 steps: 0.66%. This is why long-horizon tasks (cook a meal = thousands of steps) are so hard for BC alone. The fix: either make the policy so good that per-step error is near zero (scale + data + architecture), or add error recovery mechanisms (re-planning from new observations, action chunking).
python # Behavioral cloning training loop (simplified) for obs, action in demonstration_dataset: # obs: [3, 224, 224] camera image # action: [7] expert's 6-DoF + gripper command predicted = policy(obs) # [7] predicted action loss = F.mse_loss(predicted, action) loss.backward() optimizer.step()
How do you represent "move the arm"? This choice is critical — it determines what the model predicts, how precise it can be, and whether it transfers across robots. There are two fundamental frames: joint space (the angles of each motor) and task space (the position and orientation of the end-effector in the world).
| Representation | Format | Dims | Pros | Cons |
|---|---|---|---|---|
| Joint angles (absolute) | [θ1,...,θ7] | 7 | Direct motor control, full expressivity | Robot-specific, no transfer |
| End-effector pose (absolute) | [x, y, z, rx, ry, rz, grip] | 7 | Intuitive, more transferable | Needs IK solver, singularities |
| Delta end-effector | [Δx, Δy, Δz, Δrx, Δry, Δrz, grip] | 7 | Small values, relative motion | Errors accumulate over time |
| Discrete bins (RT-2 style) | Token index per dim | 7 tokens | Works with language models natively | Loses precision between bins |
A 7-DoF arm (like the Franka Panda) has 7 joint angles, each with a different range. Delta actions are the most common choice for VLAs: the model predicts how much to change each dimension, not the absolute target. A typical delta is tiny — (Δx=0.01m, Δy=−0.005m, Δz=0.02m) per timestep. This keeps predictions small and centered around zero, which is easier for neural networks to learn.
Drag the slider to change the number of bins. More bins = more precision but larger vocabulary. The green dot shows the discretized position.
python # Action discretization (RT-2 style) action_range = [-1.0, 1.0] # normalized action space n_bins = 256 def discretize(a): # Continuous [-1, 1] → bin index [0, 255] return int((a + 1) / 2 * (n_bins - 1)) def undiscretize(bin_idx): # Bin index [0, 255] → continuous [-1, 1] return bin_idx / (n_bins - 1) * 2 - 1 # Example: dx=0.3 → bin 166 → token "166"
A language model outputs a categorical distribution over a vocabulary of V tokens. To output a continuous 7-DOF action, you have two choices: (A) add a separate regression head (MLP → 7 floats), or (B) discretize each dimension into B bins and output 7 tokens from a vocabulary of B.
Your task: Show that approach (B) with B=256 bins loses at most 3.9mm of positioning precision for a 1m-reach robot, and explain why this is preferable to approach (A) even though it's "less precise."
Precision analysis: With B=256 bins over range [-1, 1], bin width = 2/256 = 0.0078. For a 1m-reach arm, physical resolution = 0.0078 × 1.0m = 7.8mm per bin. Maximum error = half-bin = 3.9mm. This is well within manipulation tolerance (most pick-and-place tasks have ~1cm tolerance).
Why discretization wins over regression:
1. Multimodality: Cross-entropy over 256 bins can assign probability to multiple modes. MSE collapses them to the mean.
2. Architectural unity: Same transformer, same loss, same tokenizer pipeline. Zero engineering overhead to add a new modality.
3. Autoregressive conditioning: When outputting 7 action tokens sequentially, token 2 can attend to token 1. This captures correlations between action dimensions (e.g., if dx is large, gripper should stay open). Regression outputs all 7 dims independently.
4. Pre-training transfer: The model already knows how to predict discrete tokens from multimodal context. Adding action tokens to the vocabulary is a minimal perturbation.
The key insight: 3.9mm of precision is the price. Multimodality, architectural simplicity, and pre-training transfer are the payoff. For manipulation (not surgery), this trade-off is overwhelmingly positive.
python def tokenize_action(action, action_low, action_high, n_bins=256): tokens = [] for i in range(len(action)): # Normalize to [0, 1] norm = (action[i] - action_low[i]) / (action_high[i] - action_low[i]) # Clamp to [0, 1] (safety for out-of-range actions) norm = max(0.0, min(1.0, norm)) # Map to bin index [0, n_bins-1] token = int(round(norm * (n_bins - 1))) tokens.append(token) return tokens def detokenize_action(tokens, action_low, action_high, n_bins=256): actions = [] for i in range(len(tokens)): # Bin index to [0, 1] norm = tokens[i] / (n_bins - 1) # Denormalize to robot-specific range val = norm * (action_high[i] - action_low[i]) + action_low[i] actions.append(val) return actions
Language models are trained with cross-entropy loss over a discrete vocabulary: given context, predict the probability distribution over the next token. This machinery — the softmax output layer, the cross-entropy gradient, the autoregressive generation loop — requires discrete outputs. By discretizing actions into 256 bins, each action dimension becomes "just another token" that the LM can predict using its existing training objective. No new loss function is needed. No architectural surgery. The key property is that cross-entropy over discrete bins can represent arbitrary distributions (including multimodal ones), while MSE regression over continuous outputs cannot. This is not just an engineering convenience — it's a fundamental representational advantage.
RT-2 (Robotics Transformer 2) by Google DeepMind is the landmark paper that proved VLMs can directly control robots. The key insight: fine-tune a VLM so that instead of generating text, it generates action tokens.
RT-2 takes a PaLM-E or PaLI-X vision-language model, discretizes robot actions into 256 bins per dimension, maps them to string tokens ("128", "64", "255"), and co-fine-tunes on robot demonstrations alongside web-scale vision-language data. The model's output for one timestep is literally a text string like "1 128 91 241 5 127 100" — seven numbers that get de-tokenized into 7-DOF continuous actions. No special action head. No architectural change. Just clever tokenization.
See how a continuous 7-DOF action is tokenized into text. Each dimension maps to a bin number.
The training data mix is crucial. RT-2 co-trains on web-scale vision-language data (image captioning, VQA) alongside ~130K robot episodes. The VLM data keeps the model's language understanding intact while the robot data teaches it to output actions. Without the VLM co-training, the model forgets how to understand language. Without the robot data, it can't produce valid actions. The balance matters: typically ~50% VLM data, ~50% robot data per batch.
RT-2 co-trains on VLM data (image captioning, VQA) and robot data (observation → action tokens). The total loss is a weighted sum: L = λVLM · LVLM + λaction · Laction. Both are cross-entropy over the same vocabulary.
Your task: Explain why naive equal weighting (λVLM = λaction = 1) destroys performance, and derive what ratio the field actually uses.
The problem with equal weighting:
Per-token cross-entropy loss for a batch: L = (1/N) Σ -log p(ti | context). If a VLM sample has 100 output tokens and a robot sample has 7 output tokens, equal mixing means VLM gradients are 14x larger per sample.
The RT-2 solution:
1. Mix batches 50/50 (half robot episodes, half VLM examples)
2. For robot samples: compute loss ONLY on the 7 action tokens (mask out the instruction tokens from loss computation)
3. For VLM samples: compute loss on text continuation tokens as normal
4. Normalize each loss by its token count before summing
This gives: L = (1/7) Σaction CE + (1/Ntext) Σtext CE
Effectively: per-token loss is equal across modalities, but the 50/50 data ratio ensures roughly equal gradient magnitude.
The key insight: You're not weighting losses — you're weighting data. The 50/50 mix ratio IS the loss balance. This is why OpenVLA and subsequent work report data ratios, not loss weights.
What if instead of predicting a single action, the robot could sample from a distribution over actions? Diffusion Policy applies the same denoising process used in image generation to robot action prediction.
Starting from random noise, the model iteratively refines an action trajectory. This naturally handles multimodal action distributions — when there are multiple valid ways to do something (reach from the left or right), the diffusion model can represent all of them.
Concretely, the input is noise of shape [k, action_dim] — say [50, 7] for 50 future timesteps of 7-DOF actions. The model runs 10-20 denoising steps. Each step predicts a velocity field and applies an Euler update: at+1 = at + v(at, t) · dt. At ~2ms per step × 10 steps, the total denoising takes ~20ms — fast enough for real-time control.
python # Flow matching action denoising (simplified) a = torch.randn(k, action_dim) # [50, 7] random noise for t in torch.linspace(0, 1, steps=10): v = model(a, t, obs, instruction) # predict velocity a = a + v * (1.0 / 10) # Euler step, ~2ms # a is now [50, 7] clean action trajectory
Watch random noise get denoised into a clean action trajectory. Each step removes noise. The green path is the final clean trajectory.
Both solve the same problem (multimodal action distributions) with different tools. Tokenization reuses the LM's existing machinery; diffusion adds a new generation mechanism but keeps full continuous precision. The third road — RL fine-tuning (RL lesson) — goes further: instead of imitating demonstrations, the robot optimizes a reward signal online. Current VLAs use offline IL (behavioral cloning + architecture tricks); the field is moving toward hybrid IL+RL where a VLA pre-trained on demos is then refined with online reward.
The same "discrete vs continuous" tradeoff appears in image generation (VQ-VAE tokens vs diffusion pixels). Why did images converge on diffusion while VLAs still use both approaches?
Predicting one action at a time is reactive and jerky. Action Chunking (from ACT — Action Chunking with Transformers) predicts an entire sequence of future actions at once. Instead of "what should I do now?", the model answers "what should I do for the next k steps?"
The model output shape is [k, action_dim] — for example, [50, 7] means 50 future actions at 50Hz, covering exactly 1 second of motion. The robot executes these actions open-loop (without re-reading the camera), then re-plans from the new observation.
Why chunk instead of predicting one step at a time? Two big reasons:
Compare single-step (reactive) control with chunked (planned) control. Notice how chunking is smoother.
python # Action chunking with temporal ensemble chunk = model(observation) # [k, 7] = 50 future actions for i in range(replan_interval): # execute 5-10 steps a_blend = w_new * chunk[i] + w_old * prev_chunk[i + offset] robot.execute(a_blend) # 50Hz motor commands # Then replan from new observation
The chunk size k is a critical hyperparameter. Too small (k=1) and you're back to single-step prediction with all its jitter. Too large (k=200) and the robot is flying blind for 4 seconds before it re-observes the world — if something changes (a human moves an object), the robot won't notice. In practice, k=10-50 at 50Hz (0.2-1.0 seconds) works well for manipulation. Fast tasks (catching) need smaller chunks; slow tasks (assembly) tolerate larger ones.
Single-step BC makes T predictions for a T-step task, each conditioned on the (potentially drifted) current state. Action chunking with chunk size k makes T/k predictions, each producing k actions executed open-loop.
Your task: Model the compounding error for both approaches. Assume each prediction has independent error ε per step. Show that chunking reduces the number of error-injection points from T to T/k, and explain the tradeoff.
Single-step compounding model:
Let st+1 = f(st, π(st)) where π has per-step error ε. At each step, the policy sees a state st that may be δt away from the expert's state. If the policy's error grows with state deviation: error(st) ≈ ε + αδt. This gives δt+1 = δt + ε + αδt = (1+α)δt + ε.
Solving: δT = ε · ((1+α)T - 1) / α. For α > 0, this is exponential in T.
Chunked compounding model:
With chunk size k, the model makes predictions at steps 0, k, 2k, ..., T-k. Only at these T/k points does the feedback loop re-engage. Within each chunk, errors are additive (open-loop): intra-chunk drift = k · ε. Between chunks, the compounding formula applies but with only T/k steps:
δT = kε · ((1+α)T/k - 1) / α
For T=100, k=10, α=0.05: Single-step: δ = ε · 130. Chunked: δ = 10ε · 6.3 = 63ε. Roughly 2x reduction in compounding.
The key insight: Chunking doesn't eliminate per-step error — it eliminates the feedback loop that amplifies it. Fewer observations = fewer chances for the policy to see OOD states and produce amplified errors. The cost: reduced reactivity during each chunk window.
A single fixed chunk size cannot serve both tasks. Catching requires k=1-3 (react within 30-60ms) while threading can use k=20-50 (smooth, deliberate motion over 0.5-1s). Solutions: (1) Adaptive chunking — predict a confidence/uncertainty score alongside the action chunk; when confidence is low (novel situation, fast dynamics), shorten the chunk and replan sooner. (2) Hierarchical control — a high-level planner decides chunk size per phase, while the action model generates chunks of that size. (3) Variable execution horizon — always predict a long chunk (k=50), but only execute the first few steps before replanning. The ratio of executed/predicted steps can vary by task phase. This is what ACT's temporal ensemble does implicitly: it replans every 5-10 steps regardless of chunk size, using overlapping predictions to maintain smoothness at boundaries. The key insight: chunk size should be a function of environmental dynamics, not a fixed hyperparameter.
OpenVLA is the open-source counterpart to RT-2: a 7B-parameter VLA built on Llama 2 + SigLIP, trained on the Open X-Embodiment dataset. It demonstrated that smaller, open models can rival proprietary giant VLAs.
The data flow in concrete shapes: a 224×224 image enters SigLIP and becomes 256 visual tokens of dimension 4096 (after MLP projection). The text instruction is tokenized into ~20-30 tokens. Llama 2 processes all ~280 tokens autoregressively and outputs 7 special action tokens. Each action token is one of 256 possible values (bins), which gets linearly de-mapped to a continuous float in [-1, 1], then scaled to the robot's action range.
python # OpenVLA inference (simplified) image = camera.capture() # [3, 224, 224] visual_tokens = siglip(image) # [256, 1024] visual_tokens = mlp_proj(visual_tokens) # [256, 4096] (match LLM dim) text_tokens = tokenizer("pick up the cup") # [22, 4096] input_seq = concat(visual_tokens, text_tokens) # [278, 4096] action_tokens = llama2.generate(input_seq, n=7) # 7 ints in [0, 255] action = undiscretize(action_tokens) # [7] floats in [-1, 1] robot.execute(action * action_scale) # send to robot
Trace the path from image + instruction to robot action. Each color represents a different processing stage.
| Strategy | How | Pro | Con |
|---|---|---|---|
| Full fine-tune | Update all VLM weights on robot data | Maximum expressivity | Expensive, catastrophic forgetting of language/vision |
| Frozen VLM + action head | Freeze VLM, add small MLP for actions | Cheap, preserves VLM | Limited — VLM features may not capture motor-relevant info |
| Mixture-of-Transformers | Separate expert params for vision/language vs action tokens | Best of both — preserves VLM, expressive for actions | Complex architecture (π0's approach) |
OpenVLA takes the full fine-tune path with LoRA-style efficient tuning. π0 (Physical Intelligence) takes the third path: its transformer has separate feed-forward experts for different modalities, so action tokens get their own parameters while vision and language tokens keep their pre-trained weights intact.
python # Mixture-of-Transformers (pi0-style, simplified) class MoTLayer(nn.Module): def forward(self, x, modality_mask): # Shared attention across all tokens x = self.shared_attn(x) # Separate FFN per modality out = torch.zeros_like(x) out[modality_mask == 'vision'] = self.vision_ffn(x[modality_mask == 'vision']) out[modality_mask == 'text'] = self.text_ffn(x[modality_mask == 'text']) out[modality_mask == 'action'] = self.action_ffn(x[modality_mask == 'action']) return out
A human can watch someone else cook and learn the recipe, even though their arms are different lengths. Can robots do the same? Cross-embodiment learning trains on data from many different robot types, hoping that high-level task knowledge transfers even when the hardware differs.
The Open X-Embodiment (OXE) dataset combined demonstrations from 22+ robot embodiments: single-arm manipulators, dual-arms, quadrupeds, dexterous hands. The finding: a single policy trained on all this data outperforms robot-specific policies on most tasks.
The challenge: different robots have different action spaces. A Franka Panda has 7 joints. A WidowX has 6. A quadruped has 12 (3 per leg). How does a single model handle this? The common approach is to normalize to end-effector deltas: regardless of the underlying robot, the model predicts Δxyz + Δrotation + gripper. The robot's own IK solver converts this to joint commands. This way, the VLA never needs to know joint configurations — it thinks in task space.
Different robots contribute demonstrations to a shared policy. Toggle embodiments on/off.
python # Cross-embodiment action normalization def normalize_action(action, embodiment_config): # Different robots have different action ranges # Franka: dx in [-0.05, 0.05]m, Widowx: dx in [-0.03, 0.03]m low, high = embodiment_config['action_range'] return (action - low) / (high - low) * 2 - 1 # map to [-1, 1] # During training: normalize all actions to [-1, 1] # During inference: denormalize back to robot-specific range
Real-world solution (based on ALOHA 2 + π0 architecture):
1. Action space: Predict 16-dim action chunks using flow matching (continuous, not tokenized). All 16 dims are predicted jointly so left/right arm coordination is implicit in the denoising process. Shape: [chunk_k, 16] — typically k=50 giving 50/30Hz = 1.67 seconds of coordinated bimanual motion per inference call.
2. Latency: Action chunking solves this directly. One forward pass takes ~50ms but produces k=50 actions. Effective rate: 50 actions / 50ms overhead = actions at 30Hz with plenty of budget. The robot executes the pre-planned chunk while the model starts computing the next one (pipelining).
3. Coordination: Flow matching over the joint [left_arm, right_arm] space naturally captures correlations. During denoising, the velocity field v(at, t) for the left arm's x-position depends on where the right arm is going. Joint training on bimanual demos teaches these coordination patterns implicitly.
4. Visual compression: Use a perceiver/resampler (like Q-Former) to compress 784 visual tokens down to 64-128 learned queries. Alternatively, process each camera independently and concatenate reduced representations. π0 uses the Mixture-of-Transformers approach: visual tokens pass through shared attention but separate FFNs, keeping the attention cost manageable.
Real robot data is expensive and slow to collect. A single human demonstrator might produce 10 episodes per hour — at that rate, 100K episodes takes 10,000 hours of human labor. Simulation is cheap and infinitely scalable: a GPU cluster can generate 100K episodes in hours. But policies trained in sim often fail in the real world — the reality gap. Visual differences (lighting, textures), physics mismatches (friction, contact), and sensor noise all contribute.
See how domain randomization helps bridge sim and real. Each refresh randomizes the simulated environment.
| Technique | How | What it randomizes |
|---|---|---|
| Visual DR | Change appearance every episode | Textures, colors, lighting direction, camera position (±5cm), shadows |
| Physics DR | Change dynamics every episode | Friction (0.3-1.0), mass (±20%), motor delay (0-50ms), gravity (±2%) |
| System ID | Calibrate sim to match real | Measured real-world physics parameters |
| Sim + Real co-train | Mix sim and real data in training | Nothing — real data fills remaining gap |
| Foundation Model encoder | Use VLM vision backbone | Nothing — pre-trained features bridge visual gap |
This is a fundamental shift from pre-VLA sim-to-real. Traditional approaches had to randomize textures and lighting until the sim "covered" the real distribution. With a foundation-model vision encoder, the representations are already invariant to these variations. The remaining gap is in physics: contact dynamics, friction, deformable objects. Domain randomization still helps here, but the visual gap — historically the hardest part — is largely solved by pre-training.
The practical workflow for a VLA team today: (1) collect ~50-100 real demos for a new task (~5 hours of teleop), (2) optionally augment with 10K sim demos using domain randomization, (3) fine-tune the pre-trained VLA on this combined dataset for a few hours on 8 GPUs, (4) deploy and iterate. The pre-trained VLA already knows general manipulation from OXE data; the fine-tuning teaches it the specific task and environment. This recipe can get a new pick-and-place task working in under 24 hours.
We're at the beginning of the VLA era. Current models can follow simple instructions in controlled settings, but the gap to human-level dexterity and generalization remains vast. Let's ground the current state in real numbers.
Key open challenges:
| Challenge | Current State | What's Needed |
|---|---|---|
| Dexterity | Basic grasping | In-hand manipulation, tool use |
| Long-horizon | 1-2 step tasks | Multi-step planning, error recovery |
| Safety | Lab environments | Real-world safety guarantees |
| Data scale | ~1M episodes | Internet-scale robot data |
| Speed | ~12Hz with chunking | Reactive control for dynamic tasks (100Hz+) |
Key milestones in the journey from pure VLMs to embodied agents.
You now understand the path from seeing to acting — the data flow from camera pixels to motor commands, the engineering choices of action representations and chunking, and the architectural decisions that bridge billion-parameter language models to 7-DOF robot arms. VLAs are teaching machines to reach out and touch the world.