microVLA — From Vision-Language to Robotic Action

Chapter 0: From Seeing to Acting

Vision-Language Models (VLMs) can look at an image and answer questions. But what if instead of answering with words, the model answered with motor commands? That's the leap from VLM to VLA (Vision-Language-Action): a model that sees the world, understands a language instruction, and outputs physical actions.

The insight is deceptively simple: if a VLM can generate the token sequence "pick up the red cup," why can't it instead generate the action sequence [move_to(0.3, 0.5, 0.2), close_gripper()]? A VLA treats actions as just another modality — another kind of output token.

The big idea: A VLA is a VLM whose output vocabulary has been extended to include robot actions. See + Understand + Act, all in one forward pass.

Let's trace the actual data flow. A camera captures an image: [3, 224, 224] — an RGB frame. A ViT vision encoder (like SigLIP) splits this into 14×14 = 196 patches and encodes each to a 768-dimensional vector, producing 196 visual tokens. Meanwhile, the language instruction "pick up the red cup" passes through a tokenizer into ~20-50 text tokens. These are concatenated and fed to the transformer backbone. The output: an action vector, typically 7 numbers for a robotic arm — 6 for the end-effector pose (x, y, z position + roll, pitch, yaw rotation) plus 1 for the gripper (open/close).

Camera

[3, 224, 224] RGB image

↓ ViT (SigLIP)

Visual tokens

196 tokens × 768 dims

+ concatenate

Text tokens

~20-50 tokens × 768 dims

↓ Transformer backbone

Action output

[7] = 6-DoF pose + gripper

VLM vs VLA Pipeline

Watch how the same backbone produces different outputs: text for VLMs, actions for VLAs.

The numbers: A typical manipulation action space is just 7 floats: [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]. That's 28 bytes of output per timestep. Compare that to a VLM generating text: each token is one of 32K+ possible values, and a response might be 100+ tokens. The action space is tiny and continuous — which is why discretizing it into 256 bins per dimension is practical. 256⁷ ≈ 7×10¹⁶ possible actions seems large, but the model only ever outputs 7 tokens, not a combinatorial search.

Check: What is the key difference between a VLM and a VLA?

VLAs use a different vision encoder VLAs output physical actions instead of (or in addition to) text VLAs don't use language at all

🔗 Pattern Recognition

VLA = VLM + Action Head

This Lesson (VLA)

Input: [image, text] → Transformer → Output: action tokens [7 floats]

VLM Lesson

Input: [image, text] → Transformer → Output: text tokens [variable length] → VLM lesson

The architectural pattern is identical: multimodal encoder → shared backbone → autoregressive decoder. The only difference is the output vocabulary. A VLM decodes from a 32K-token text vocabulary. A VLA decodes from a 256-bin action vocabulary (per dimension). Same attention, same context window, same next-token prediction loss. This is why VLAs can bootstrap from pre-trained VLMs — the backbone already knows how to fuse vision and language.

Where else have you seen "same architecture, different output head"? Think about how masked language models become classifiers with a single linear layer swap.

Chapter 1: Behavioral Cloning

The simplest way to teach a robot: show it what to do. A human demonstrates the task (teleoperation), recording observations and actions. Then we train a neural network to predict the expert's action given the current observation. This is behavioral cloning (BC) — supervised learning for robotics.

π(a | o) = argmin Σ ||f_θ(o_t) − a_t||²

In practice, BC training data looks like this: each episode is a sequence of (observation, action) pairs. The observation is a camera image [3, 224, 224]. The action is a 7-DOF vector [Δx, Δy, Δz, Δrx, Δry, Δrz, gripper]. A 10-second demonstration at 10Hz gives 100 training pairs. Typical VLA datasets have 100K-1M such episodes.

BC is simple but fragile. If the robot drifts slightly off the demonstrated path, it encounters states it has never seen before — the distribution shift problem. Small errors compound, and the robot spirals into failure.

Behavioral Cloning: Compounding Errors

The teal path is the expert demo. The red path is the BC agent. Watch errors compound as it drifts away.

Noise level1.5

Why it still matters: Despite its flaws, BC is the data-collection backbone of all VLAs. RT-1, RT-2, and OpenVLA all learn from human demonstrations. The trick is combining BC with architectures powerful enough to generalize.

How bad is compounding error in practice? If each step has a 1% chance of a small drift, after 100 steps the probability of being on-track is 0.99¹⁰⁰ = 36%. After 500 steps: 0.66%. This is why long-horizon tasks (cook a meal = thousands of steps) are so hard for BC alone. The fix: either make the policy so good that per-step error is near zero (scale + data + architecture), or add error recovery mechanisms (re-planning from new observations, action chunking).

python
# Behavioral cloning training loop (simplified)
for obs, action in demonstration_dataset:
    # obs: [3, 224, 224] camera image
    # action: [7] expert's 6-DoF + gripper command
    predicted = policy(obs)           # [7] predicted action
    loss = F.mse_loss(predicted, action)
    loss.backward()
    optimizer.step()

Check: What is the main failure mode of behavioral cloning?

It requires too much compute Small errors compound because the agent visits unseen states It can only handle one task

⚔ Adversarial: Your VLA trained on teleoperation data always moves slower than the human demonstrator. The actions are correctly tokenized and de-tokenized. The bin resolution is sufficient. What's causing this?

You collected 500 teleoperation demos at 10Hz. Each demo has a human smoothly reaching for objects at natural speed. Your trained VLA reaches for the same objects but takes 2-3x longer, even though the predicted action magnitudes look reasonable individually.

The vision encoder is too slow, creating inference latency MSE loss causes mode-averaging: when demos vary in trajectory, the policy predicts smaller, cautious actions that average between the modes The discretization bins are too coarse

Chapter 2: Action Representations

How do you represent "move the arm"? This choice is critical — it determines what the model predicts, how precise it can be, and whether it transfers across robots. There are two fundamental frames: joint space (the angles of each motor) and task space (the position and orientation of the end-effector in the world).

Representation	Format	Dims	Pros	Cons
Joint angles (absolute)	[θ₁,...,θ₇]	7	Direct motor control, full expressivity	Robot-specific, no transfer
End-effector pose (absolute)	[x, y, z, rx, ry, rz, grip]	7	Intuitive, more transferable	Needs IK solver, singularities
Delta end-effector	[Δx, Δy, Δz, Δrx, Δry, Δrz, grip]	7	Small values, relative motion	Errors accumulate over time
Discrete bins (RT-2 style)	Token index per dim	7 tokens	Works with language models natively	Loses precision between bins

A 7-DoF arm (like the Franka Panda) has 7 joint angles, each with a different range. Delta actions are the most common choice for VLAs: the model predicts how much to change each dimension, not the absolute target. A typical delta is tiny — (Δx=0.01m, Δy=−0.005m, Δz=0.02m) per timestep. This keeps predictions small and centered around zero, which is easier for neural networks to learn.

Action Discretization

Drag the slider to change the number of bins. More bins = more precision but larger vocabulary. The green dot shows the discretized position.

Bins per dim16

The RT-2 trick: Discretize each action dimension into 256 bins, then map each bin to an existing text token (like the numbers 0-255). This lets a language model output actions without any architectural changes!

python
# Action discretization (RT-2 style)
action_range = [-1.0, 1.0]          # normalized action space
n_bins = 256

def discretize(a):
    # Continuous [-1, 1] → bin index [0, 255]
    return int((a + 1) / 2 * (n_bins - 1))

def undiscretize(bin_idx):
    # Bin index [0, 255] → continuous [-1, 1]
    return bin_idx / (n_bins - 1) * 2 - 1

# Example: dx=0.3 → bin 166 → token "166"

Precision check: With 256 bins over range [-1, 1], the resolution per bin is 2/256 ≈ 0.0078. For a robot arm with 1-meter reach, that's ~7.8mm positioning precision. Good enough for picking up cups. Not good enough for threading a needle. Higher precision requires more bins (1024 bins → ~2mm), but each extra bit doubles the vocabulary per dimension. This is why some VLAs (like π₀) use continuous action heads with flow matching instead of discretization — infinite precision, no binning artifacts.

Check: Why do VLA models often discretize continuous actions?

So they can be treated as tokens and predicted by a language model Because robots only accept discrete commands To reduce the training data needed

🔨 Derivation Why Discretize? The Information-Theoretic Argument ▶ ✓ ATTEMPTED

A language model outputs a categorical distribution over a vocabulary of V tokens. To output a continuous 7-DOF action, you have two choices: (A) add a separate regression head (MLP → 7 floats), or (B) discretize each dimension into B bins and output 7 tokens from a vocabulary of B.

Your task: Show that approach (B) with B=256 bins loses at most 3.9mm of positioning precision for a 1m-reach robot, and explain why this is preferable to approach (A) even though it's "less precise."

If the action range is [-1, 1] (normalized) mapped to a physical range of [-0.5m, 0.5m] (1m total reach), then each of 256 bins covers 1.0m / 256 = 3.9mm. The maximum discretization error is half a bin width: 1.95mm.

A regression head outputs 7 independent floats using MSE loss. The problem: MSE loss cannot represent multimodal distributions. When there are two valid actions (reach left or right), regression averages them. A categorical distribution over 256 bins CAN represent multimodality — it can put probability mass on bin 50 AND bin 200 simultaneously.

Tokenized actions use cross-entropy loss — the same loss the model was pre-trained with. No architecture change, no new loss function, no new optimizer hyperparameters. The model treats action tokens exactly like word tokens: attend to context, predict next token. Pre-training knowledge transfers directly.

Precision analysis: With B=256 bins over range [-1, 1], bin width = 2/256 = 0.0078. For a 1m-reach arm, physical resolution = 0.0078 × 1.0m = 7.8mm per bin. Maximum error = half-bin = 3.9mm. This is well within manipulation tolerance (most pick-and-place tasks have ~1cm tolerance).

Why discretization wins over regression:

1. Multimodality: Cross-entropy over 256 bins can assign probability to multiple modes. MSE collapses them to the mean.

2. Architectural unity: Same transformer, same loss, same tokenizer pipeline. Zero engineering overhead to add a new modality.

3. Autoregressive conditioning: When outputting 7 action tokens sequentially, token 2 can attend to token 1. This captures correlations between action dimensions (e.g., if dx is large, gripper should stay open). Regression outputs all 7 dims independently.

4. Pre-training transfer: The model already knows how to predict discrete tokens from multimodal context. Adding action tokens to the vocabulary is a minimal perturbation.

The key insight: 3.9mm of precision is the price. Multimodality, architectural simplicity, and pre-training transfer are the payoff. For manipulation (not surgery), this trade-off is overwhelmingly positive.

💻 Build It Implement Action Tokenization & Detokenization ▶ ✓ ATTEMPTED

Implement the full action tokenization pipeline used in RT-2 and OpenVLA. Given a continuous action vector (7 floats in arbitrary robot-specific ranges), normalize to [-1, 1], discretize into B bins, and provide the inverse (detokenization) that recovers the continuous value from a bin index.

signature def tokenize_action(action: list[float], action_low: list[float], action_high: list[float], n_bins: int = 256) -> list[int]: """Convert continuous action to bin indices. Args: action (7 floats), per-dim min/max, number of bins. Returns: list of 7 integers in [0, n_bins-1].""" def detokenize_action(tokens: list[int], action_low: list[float], action_high: list[float], n_bins: int = 256) -> list[float]: """Convert bin indices back to continuous action. Returns: list of 7 floats in original robot-specific ranges."""

Test case

action = [0.1, -0.05, 0.3, 0.0, 0.0, 0.0, 1.0] low = [-0.5, -0.5, -0.5, -3.14, -3.14, -3.14, 0.0] high = [0.5, 0.5, 0.5, 3.14, 3.14, 3.14, 1.0] tokens = tokenize_action(action, low, high) # Expected: [153, 115, 204, 128, 128, 128, 255] recovered = detokenize_action(tokens, low, high) # Expected: close to original (within bin resolution) max_error = max(abs(a-r) for a,r in zip(action, recovered)) assert max_error < 0.01, f"Error too large: {max_error}"

First normalize each dimension: norm = (action[i] - low[i]) / (high[i] - low[i]). This gives a value in [0, 1]. Then scale to bin index: bin = int(norm * (n_bins - 1)). Clamp to [0, n_bins-1] for safety. Detokenization is the exact inverse: norm = token / (n_bins - 1), then action = norm * (high - low) + low.

python
def tokenize_action(action, action_low, action_high, n_bins=256):
    tokens = []
    for i in range(len(action)):
        # Normalize to [0, 1]
        norm = (action[i] - action_low[i]) / (action_high[i] - action_low[i])
        # Clamp to [0, 1] (safety for out-of-range actions)
        norm = max(0.0, min(1.0, norm))
        # Map to bin index [0, n_bins-1]
        token = int(round(norm * (n_bins - 1)))
        tokens.append(token)
    return tokens

def detokenize_action(tokens, action_low, action_high, n_bins=256):
    actions = []
    for i in range(len(tokens)):
        # Bin index to [0, 1]
        norm = tokens[i] / (n_bins - 1)
        # Denormalize to robot-specific range
        val = norm * (action_high[i] - action_low[i]) + action_low[i]
        actions.append(val)
    return actions

Bonus challenge: Extend this to handle action chunks — tokenize a [k, 7] trajectory into k*7 tokens. How would you handle the ordering? (Hint: RT-2 uses time-major: all 7 dims of timestep 0, then all 7 of timestep 1, etc.)

Checkpoint — Before you move on

Explain in your own words: why does discretizing continuous actions into 256 bins enable the use of pre-trained language models for robot control? What specific property of LM training makes this work?

✓ Gate cleared

Model Answer

Language models are trained with cross-entropy loss over a discrete vocabulary: given context, predict the probability distribution over the next token. This machinery — the softmax output layer, the cross-entropy gradient, the autoregressive generation loop — requires discrete outputs. By discretizing actions into 256 bins, each action dimension becomes "just another token" that the LM can predict using its existing training objective. No new loss function is needed. No architectural surgery. The key property is that cross-entropy over discrete bins can represent arbitrary distributions (including multimodal ones), while MSE regression over continuous outputs cannot. This is not just an engineering convenience — it's a fundamental representational advantage.

Chapter 3: RT-2 — Language as Action

RT-2 (Robotics Transformer 2) by Google DeepMind is the landmark paper that proved VLMs can directly control robots. The key insight: fine-tune a VLM so that instead of generating text, it generates action tokens.

RT-2 takes a PaLM-E or PaLI-X vision-language model, discretizes robot actions into 256 bins per dimension, maps them to string tokens ("128", "64", "255"), and co-fine-tunes on robot demonstrations alongside web-scale vision-language data. The model's output for one timestep is literally a text string like "1 128 91 241 5 127 100" — seven numbers that get de-tokenized into 7-DOF continuous actions. No special action head. No architectural change. Just clever tokenization.

Input

[Camera image] + "Pick up the bottle"

↓

VLM Backbone

PaLM-E (55B) or PaLI-X (55B)

↓

Output

"1 128 91 241 5 127 100" (7 action dims)

RT-2 Action Tokenization

See how a continuous 7-DOF action is tokenized into text. Each dimension maps to a bin number.

Why this works: VLMs pretrained on internet data already understand spatial concepts, object relationships, and goals. RT-2 showed that this knowledge transfers: "pick up the bottle near the apple" works even if the robot never practiced that exact arrangement.

The training data mix is crucial. RT-2 co-trains on web-scale vision-language data (image captioning, VQA) alongside ~130K robot episodes. The VLM data keeps the model's language understanding intact while the robot data teaches it to output actions. Without the VLM co-training, the model forgets how to understand language. Without the robot data, it can't produce valid actions. The balance matters: typically ~50% VLM data, ~50% robot data per batch.

Emergent capabilities: RT-2 can follow instructions involving concepts never seen in robot training data: "move the banana to the country that has this flag" (showing a Brazilian flag). The VLM backbone recognizes flags from internet pre-training; the action head learned general pick-and-place from robot data. Composing these = zero-shot reasoning about novel semantic concepts, grounded in physical action.

Check: How does RT-2 represent robot actions?

As continuous vectors from a separate action head As image patches As discretized tokens in the language model's vocabulary

🔨 Derivation Loss Weighting: Language Tokens vs Action Tokens ▶ ✓ ATTEMPTED

RT-2 co-trains on VLM data (image captioning, VQA) and robot data (observation → action tokens). The total loss is a weighted sum: L = λ_VLM · L_VLM + λ_action · L_action. Both are cross-entropy over the same vocabulary.

Your task: Explain why naive equal weighting (λ_VLM = λ_action = 1) destroys performance, and derive what ratio the field actually uses.

A VLM caption might be 50-200 tokens. An action output is exactly 7 tokens (one per dimension). If you weight per-sample equally, the gradient from one VLM sample is 50/7 ≈ 7x stronger than one action sample (more tokens = more gradient). The action signal gets drowned out.

If λ_action is too high, the model catastrophically forgets language understanding. It can output action tokens perfectly but no longer understands instructions like "pick up the red cup near the banana." The language grounding that makes VLAs powerful — semantic understanding from internet-scale pre-training — erodes.

RT-2 uses ~50/50 data mixing (half robot, half VLM) but applies the loss only to the output tokens appropriate to each sample type. For VLM data, loss is on text continuation. For robot data, loss is on the 7 action tokens only (not the instruction tokens). This way both losses contribute roughly equally to the gradient despite the token count mismatch.

The problem with equal weighting:

Per-token cross-entropy loss for a batch: L = (1/N) Σ -log p(t_i | context). If a VLM sample has 100 output tokens and a robot sample has 7 output tokens, equal mixing means VLM gradients are 14x larger per sample.

The RT-2 solution:

1. Mix batches 50/50 (half robot episodes, half VLM examples)

2. For robot samples: compute loss ONLY on the 7 action tokens (mask out the instruction tokens from loss computation)

3. For VLM samples: compute loss on text continuation tokens as normal

4. Normalize each loss by its token count before summing

This gives: L = (1/7) Σ_action CE + (1/N_text) Σ_text CE

Effectively: per-token loss is equal across modalities, but the 50/50 data ratio ensures roughly equal gradient magnitude.

The key insight: You're not weighting losses — you're weighting data. The 50/50 mix ratio IS the loss balance. This is why OpenVLA and subsequent work report data ratios, not loss weights.

Chapter 4: Diffusion Policies

What if instead of predicting a single action, the robot could sample from a distribution over actions? Diffusion Policy applies the same denoising process used in image generation to robot action prediction.

Starting from random noise, the model iteratively refines an action trajectory. This naturally handles multimodal action distributions — when there are multiple valid ways to do something (reach from the left or right), the diffusion model can represent all of them.

a₀ ~ N(0, I) → a₁ → ... → a_T = denoised action

Concretely, the input is noise of shape [k, action_dim] — say [50, 7] for 50 future timesteps of 7-DOF actions. The model runs 10-20 denoising steps. Each step predicts a velocity field and applies an Euler update: a_t+1 = a_t + v(a_t, t) · dt. At ~2ms per step × 10 steps, the total denoising takes ~20ms — fast enough for real-time control.

Flow matching vs DDPM: π₀ (Physical Intelligence) uses flow matching instead of DDPM-style diffusion. Flow matching learns a straight-line velocity field from noise to data, requiring fewer steps (10 vs 50+). Each step is simpler: predict velocity v, Euler update a_t+1 = a_t + v · dt. Same multimodal benefits, 5× faster inference.

python
# Flow matching action denoising (simplified)
a = torch.randn(k, action_dim)     # [50, 7] random noise
for t in torch.linspace(0, 1, steps=10):
    v = model(a, t, obs, instruction)  # predict velocity
    a = a + v * (1.0 / 10)          # Euler step, ~2ms
# a is now [50, 7] clean action trajectory

Diffusion Denoising for Actions

Watch random noise get denoised into a clean action trajectory. Each step removes noise. The green path is the final clean trajectory.

Why not just regression? Imagine a robot reaching for a cup. Half the demos go around the left side of an obstacle, half go right. An MSE-trained policy averages them: it predicts a straight-through-the-obstacle trajectory. Diffusion doesn't average — it samples. Each denoising run commits to one mode (left or right), producing a valid trajectory every time. This is why flow matching matters: it captures multimodal action distributions.

Check: What problem does Diffusion Policy solve that regression can't?

Multimodal action distributions (multiple valid solutions) Faster inference Less training data needed

🔗 Pattern Recognition

Three Roads to Robot Actions: Discrete Tokens vs Diffusion vs RL

Discrete Tokenization (RT-2)

Actions as categorical tokens. Cross-entropy loss. Autoregressive generation. Multimodal via mixture of bins.

Diffusion Policy (π₀)

Actions as continuous vectors. Flow matching loss. Iterative denoising. Multimodal via sampling. → Diffusion lesson

Both solve the same problem (multimodal action distributions) with different tools. Tokenization reuses the LM's existing machinery; diffusion adds a new generation mechanism but keeps full continuous precision. The third road — RL fine-tuning (RL lesson) — goes further: instead of imitating demonstrations, the robot optimizes a reward signal online. Current VLAs use offline IL (behavioral cloning + architecture tricks); the field is moving toward hybrid IL+RL where a VLA pre-trained on demos is then refined with online reward.

The same "discrete vs continuous" tradeoff appears in image generation (VQ-VAE tokens vs diffusion pixels). Why did images converge on diffusion while VLAs still use both approaches?

Chapter 5: Action Chunking

Predicting one action at a time is reactive and jerky. Action Chunking (from ACT — Action Chunking with Transformers) predicts an entire sequence of future actions at once. Instead of "what should I do now?", the model answers "what should I do for the next k steps?"

The model output shape is [k, action_dim] — for example, [50, 7] means 50 future actions at 50Hz, covering exactly 1 second of motion. The robot executes these actions open-loop (without re-reading the camera), then re-plans from the new observation.

Why chunk instead of predicting one step at a time? Two big reasons:

Temporal consistency: A single prediction covers a smooth trajectory. Single-step predictions can oscillate between competing modes at every timestep, causing jitter.

Compute amortization: One VLM forward pass (~50ms) produces 1 second of actions. That's 50 actions from 1 inference call, vs 50 inference calls for single-step. The robot acts at 50Hz while the model only thinks at 1Hz.

Single-Step vs Chunked Actions

Compare single-step (reactive) control with chunked (planned) control. Notice how chunking is smoother.

Chunk size H8

Temporal ensemble: ACT uses an exponential moving average over overlapping chunks to further smooth the executed trajectory. The robot doesn't wait for a full chunk to finish before replanning — it replans every few steps, creating overlapping predictions. At timestep t, multiple chunks have opinions about what to do: the chunk planned 3 steps ago, the chunk from 6 steps ago, etc. Blending them with exponential weighting (recent chunks matter more) eliminates jitter at chunk boundaries.

python
# Action chunking with temporal ensemble
chunk = model(observation)           # [k, 7] = 50 future actions
for i in range(replan_interval):     # execute 5-10 steps
    a_blend = w_new * chunk[i] + w_old * prev_chunk[i + offset]
    robot.execute(a_blend)           # 50Hz motor commands
# Then replan from new observation

The chunk size k is a critical hyperparameter. Too small (k=1) and you're back to single-step prediction with all its jitter. Too large (k=200) and the robot is flying blind for 4 seconds before it re-observes the world — if something changes (a human moves an object), the robot won't notice. In practice, k=10-50 at 50Hz (0.2-1.0 seconds) works well for manipulation. Fast tasks (catching) need smaller chunks; slow tasks (assembly) tolerate larger ones.

Check: What does action chunking predict?

A single next action The reward for the current state A sequence of H future actions at once

🔨 Derivation Why Chunking Reduces Compounding Errors ▶ ✓ ATTEMPTED

Single-step BC makes T predictions for a T-step task, each conditioned on the (potentially drifted) current state. Action chunking with chunk size k makes T/k predictions, each producing k actions executed open-loop.

Your task: Model the compounding error for both approaches. Assume each prediction has independent error ε per step. Show that chunking reduces the number of error-injection points from T to T/k, and explain the tradeoff.

In single-step BC, each prediction is made from a (possibly corrupted) state. If the per-step error is ε and errors accumulate linearly (worst case), after T steps the total drift is proportional to T · ε. But it's worse than linear: each error shifts the state, causing the next prediction to be made from an OOD state, amplifying error. The actual scaling is O(T · ε · (1 + α)^T) where α is the drift amplification factor.

With chunk size k, the model makes T/k prediction calls. Within each chunk, actions are executed open-loop (no re-observation), so intra-chunk errors don't compound through the policy. Only at chunk boundaries does the model re-observe and potentially amplify drift. So the "compounding opportunities" drop from T to T/k.

Within a chunk, the robot is blind — if the environment changes (human moves an object, unexpected collision), the robot won't react for up to k/Hz seconds. Larger k = fewer compounding opportunities but slower reaction. The sweet spot depends on task dynamics: fast-changing environments need small k, static scenes tolerate large k.

Single-step compounding model:

Let s_t+1 = f(s_t, π(s_t)) where π has per-step error ε. At each step, the policy sees a state s_t that may be δ_t away from the expert's state. If the policy's error grows with state deviation: error(s_t) ≈ ε + αδ_t. This gives δ_t+1 = δ_t + ε + αδ_t = (1+α)δ_t + ε.

Solving: δ_T = ε · ((1+α)^T - 1) / α. For α > 0, this is exponential in T.

Chunked compounding model:

With chunk size k, the model makes predictions at steps 0, k, 2k, ..., T-k. Only at these T/k points does the feedback loop re-engage. Within each chunk, errors are additive (open-loop): intra-chunk drift = k · ε. Between chunks, the compounding formula applies but with only T/k steps:

δ_T = kε · ((1+α)^T/k - 1) / α

For T=100, k=10, α=0.05: Single-step: δ = ε · 130. Chunked: δ = 10ε · 6.3 = 63ε. Roughly 2x reduction in compounding.

The key insight: Chunking doesn't eliminate per-step error — it eliminates the feedback loop that amplifies it. Fewer observations = fewer chances for the policy to see OOD states and produce amplified errors. The cost: reduced reactivity during each chunk window.

💥 Break-It Lab What Dies When You Remove VLA Components? ▶ ✓ ATTEMPTED

A complete VLA has three critical components beyond the base model: action chunking (smooth trajectories), proprioception input (joint state feedback), and language conditioning (task specification). Each serves a distinct purpose. Toggle them off to see what breaks.

Remove Action Chunking ACTIVE

Failure mode: Without chunking, the policy predicts one action per timestep. At each step it re-observes and re-decides, creating temporal jitter — the robot oscillates between competing strategies frame-to-frame. A reaching motion becomes a series of jerky micro-corrections instead of a smooth arc. For manipulation, this causes dropped objects (the gripper oscillates between open/close near the grasp point).

Remove Proprioception Input ACTIVE

Failure mode: Without joint state feedback, the robot doesn't know where its arm currently is. It must infer arm position purely from the camera image — but the arm is often out of frame or self-occluded. The result: spatial unawareness. The robot overshoots targets (it can't tell when it's arrived), collides with objects it can't see behind its own gripper, and fails at precision tasks where sub-centimeter accuracy requires knowing the current end-effector pose.

Remove Language Conditioning ACTIVE

Failure mode: Without a language instruction, the model has no task specification. Given an image of a table with multiple objects, it doesn't know which object to pick up or where to place it. The policy collapses to the average behavior across all tasks in the training data — a generic "reach toward the center of the scene" motion that accomplishes nothing specific. This demonstrates that language is not decoration; it's the steering signal that selects among the model's learned behaviors.

Checkpoint — Before you move on

You're designing a VLA for a task that involves both fast reactive motions (catching a thrown object) and slow precise motions (threading a needle). How would you handle action chunking? Can you use a single fixed chunk size, or do you need something more sophisticated?

✓ Gate cleared

Model Answer

A single fixed chunk size cannot serve both tasks. Catching requires k=1-3 (react within 30-60ms) while threading can use k=20-50 (smooth, deliberate motion over 0.5-1s). Solutions: (1) Adaptive chunking — predict a confidence/uncertainty score alongside the action chunk; when confidence is low (novel situation, fast dynamics), shorten the chunk and replan sooner. (2) Hierarchical control — a high-level planner decides chunk size per phase, while the action model generates chunks of that size. (3) Variable execution horizon — always predict a long chunk (k=50), but only execute the first few steps before replanning. The ratio of executed/predicted steps can vary by task phase. This is what ACT's temporal ensemble does implicitly: it replans every 5-10 steps regardless of chunk size, using overlapping predictions to maintain smoothness at boundaries. The key insight: chunk size should be a function of environmental dynamics, not a fixed hyperparameter.

Chapter 6: OpenVLA Architecture

OpenVLA is the open-source counterpart to RT-2: a 7B-parameter VLA built on Llama 2 + SigLIP, trained on the Open X-Embodiment dataset. It demonstrated that smaller, open models can rival proprietary giant VLAs.

The data flow in concrete shapes: a 224×224 image enters SigLIP and becomes 256 visual tokens of dimension 4096 (after MLP projection). The text instruction is tokenized into ~20-30 tokens. Llama 2 processes all ~280 tokens autoregressively and outputs 7 special action tokens. Each action token is one of 256 possible values (bins), which gets linearly de-mapped to a continuous float in [-1, 1], then scaled to the robot's action range.

python
# OpenVLA inference (simplified)
image = camera.capture()                    # [3, 224, 224]
visual_tokens = siglip(image)               # [256, 1024]
visual_tokens = mlp_proj(visual_tokens)     # [256, 4096] (match LLM dim)
text_tokens = tokenizer("pick up the cup") # [22, 4096]
input_seq = concat(visual_tokens, text_tokens)  # [278, 4096]
action_tokens = llama2.generate(input_seq, n=7) # 7 ints in [0, 255]
action = undiscretize(action_tokens)        # [7] floats in [-1, 1]
robot.execute(action * action_scale)        # send to robot

SigLIP Vision Encoder

224×224 image → vision tokens

↓

Projection MLP

Align vision features to LLM space

↓

Llama 2 (7B)

[vision tokens] + "pick up the cup" → action tokens

↓

De-tokenize

256-bin tokens → 7-DOF continuous action

OpenVLA Token Flow

Trace the path from image + instruction to robot action. Each color represents a different processing stage.

Training data: OpenVLA trained on 970K episodes from Open X-Embodiment spanning 22 robot types and hundreds of tasks. This diversity is what gives it generalization — it's seen enough variety to handle novel situations.

The backbone problem: Pre-trained VLMs (Llama, PaLM, Qwen) know language and vision but NOT motor control. How do you add action capability without destroying existing knowledge? Three strategies exist:

Strategy	How	Pro	Con
Full fine-tune	Update all VLM weights on robot data	Maximum expressivity	Expensive, catastrophic forgetting of language/vision
Frozen VLM + action head	Freeze VLM, add small MLP for actions	Cheap, preserves VLM	Limited — VLM features may not capture motor-relevant info
Mixture-of-Transformers	Separate expert params for vision/language vs action tokens	Best of both — preserves VLM, expressive for actions	Complex architecture (π₀'s approach)

OpenVLA takes the full fine-tune path with LoRA-style efficient tuning. π₀ (Physical Intelligence) takes the third path: its transformer has separate feed-forward experts for different modalities, so action tokens get their own parameters while vision and language tokens keep their pre-trained weights intact.

python
# Mixture-of-Transformers (pi0-style, simplified)
class MoTLayer(nn.Module):
    def forward(self, x, modality_mask):
        # Shared attention across all tokens
        x = self.shared_attn(x)
        # Separate FFN per modality
        out = torch.zeros_like(x)
        out[modality_mask == 'vision']  = self.vision_ffn(x[modality_mask == 'vision'])
        out[modality_mask == 'text']    = self.text_ffn(x[modality_mask == 'text'])
        out[modality_mask == 'action']  = self.action_ffn(x[modality_mask == 'action'])
        return out

Check: What is OpenVLA's language model backbone?

GPT-4 Llama 2 (7B) PaLM-E (55B)

Chapter 7: Cross-Embodiment

A human can watch someone else cook and learn the recipe, even though their arms are different lengths. Can robots do the same? Cross-embodiment learning trains on data from many different robot types, hoping that high-level task knowledge transfers even when the hardware differs.

The Open X-Embodiment (OXE) dataset combined demonstrations from 22+ robot embodiments: single-arm manipulators, dual-arms, quadrupeds, dexterous hands. The finding: a single policy trained on all this data outperforms robot-specific policies on most tasks.

The challenge: different robots have different action spaces. A Franka Panda has 7 joints. A WidowX has 6. A quadruped has 12 (3 per leg). How does a single model handle this? The common approach is to normalize to end-effector deltas: regardless of the underlying robot, the model predicts Δxyz + Δrotation + gripper. The robot's own IK solver converts this to joint commands. This way, the VLA never needs to know joint configurations — it thinks in task space.

Cross-Embodiment Transfer

Different robots contribute demonstrations to a shared policy. Toggle embodiments on/off.

The positive transfer hypothesis: Even though a WidowX arm and a Franka Panda have different kinematics, they share the same visual semantics ("what is a cup?") and task structures ("pick, then place"). It's this shared structure that enables transfer. OXE showed that a model trained on data from 22 robots outperforms models trained on data from any single robot, even when tested on that single robot. The diverse data acts as regularization — it forces the model to learn task-general features rather than robot-specific shortcuts.

python
# Cross-embodiment action normalization
def normalize_action(action, embodiment_config):
    # Different robots have different action ranges
    # Franka: dx in [-0.05, 0.05]m, Widowx: dx in [-0.03, 0.03]m
    low, high = embodiment_config['action_range']
    return (action - low) / (high - low) * 2 - 1  # map to [-1, 1]

# During training: normalize all actions to [-1, 1]
# During inference: denormalize back to robot-specific range

Check: Why does cross-embodiment training help?

Robots share high-level task understanding even with different hardware All robots have the same motors It reduces the model size

🏗 Design Challenge You're the Architect: Bimanual VLA for Kitchen Tasks ▶ ✓ ATTEMPTED

You're designing a VLA for a bimanual robot that must perform kitchen tasks like "unscrew the jar lid and pour the contents into the bowl." The robot has two 7-DOF arms with parallel-jaw grippers, stereo cameras (2 views), and receives natural language commands. You need to hit 30Hz control rate for smooth bimanual coordination.

Robot

2 × 7-DOF arms + 2 grippers = 16-dim action space

Control rate

30Hz (33ms per action)

Cameras

2 × stereo (4 images total, 224×224 each)

Language

Natural language commands, variable length

Latency budget

VLM forward pass ~50ms (exceeds 33ms per-step!)

GPU

Single A100 (80GB) for inference

1. Action space: How do you tokenize 16 dimensions? Independent bins per dim (16 tokens × 256 bins)? Or group left/right arms separately?

2. Latency: The VLM takes 50ms but you need 30Hz (33ms). How do you reconcile this? (Hint: chunking isn't just for smoothness.)

3. Coordination: The two arms must coordinate (one holds the jar while the other unscrews). How do you ensure the action output captures bimanual dependencies?

4. Visual input: 4 images × 196 patches = 784 visual tokens. That's expensive. How do you reduce the visual token count without losing spatial information needed for bimanual reaching?

Real-world solution (based on ALOHA 2 + π₀ architecture):

1. Action space: Predict 16-dim action chunks using flow matching (continuous, not tokenized). All 16 dims are predicted jointly so left/right arm coordination is implicit in the denoising process. Shape: [chunk_k, 16] — typically k=50 giving 50/30Hz = 1.67 seconds of coordinated bimanual motion per inference call.

2. Latency: Action chunking solves this directly. One forward pass takes ~50ms but produces k=50 actions. Effective rate: 50 actions / 50ms overhead = actions at 30Hz with plenty of budget. The robot executes the pre-planned chunk while the model starts computing the next one (pipelining).

3. Coordination: Flow matching over the joint [left_arm, right_arm] space naturally captures correlations. During denoising, the velocity field v(a_t, t) for the left arm's x-position depends on where the right arm is going. Joint training on bimanual demos teaches these coordination patterns implicitly.

4. Visual compression: Use a perceiver/resampler (like Q-Former) to compress 784 visual tokens down to 64-128 learned queries. Alternatively, process each camera independently and concatenate reduced representations. π₀ uses the Mixture-of-Transformers approach: visual tokens pass through shared attention but separate FFNs, keeping the attention cost manageable.

⚔ Adversarial: Your cross-embodiment VLA was trained on data from 20 robot types. When deployed on a new 6-DOF arm (not in training), it works for pick-and-place but fails catastrophically on insertion tasks (peg-in-hole). The action normalization is correct. What's the real issue?

The new robot's 6-DOF arm has a different kinematic structure than any training robot — its wrist has limited rotation range (±90 degrees vs ±180 for training robots). The VLA predicts end-effector delta actions that get converted to joint commands via IK.

The vision encoder hasn't seen this robot type The language model doesn't understand "insertion" The VLA predicts rotations the robot physically can't achieve, and the IK solver fails or clips them, producing incorrect orientations for precise insertion

Chapter 8: Sim-to-Real Transfer

Real robot data is expensive and slow to collect. A single human demonstrator might produce 10 episodes per hour — at that rate, 100K episodes takes 10,000 hours of human labor. Simulation is cheap and infinitely scalable: a GPU cluster can generate 100K episodes in hours. But policies trained in sim often fail in the real world — the reality gap. Visual differences (lighting, textures), physics mismatches (friction, contact), and sensor noise all contribute.

The Reality Gap

See how domain randomization helps bridge sim and real. Each refresh randomizes the simulated environment.

Domain randomization50%

Technique	How	What it randomizes
Visual DR	Change appearance every episode	Textures, colors, lighting direction, camera position (±5cm), shadows
Physics DR	Change dynamics every episode	Friction (0.3-1.0), mass (±20%), motor delay (0-50ms), gravity (±2%)
System ID	Calibrate sim to match real	Measured real-world physics parameters
Sim + Real co-train	Mix sim and real data in training	Nothing — real data fills remaining gap
Foundation Model encoder	Use VLM vision backbone	Nothing — pre-trained features bridge visual gap

VLAs as a bridge: VLMs pretrained on internet images already understand real-world visual features. When a VLA uses a pretrained vision encoder (SigLIP trained on 10B+ image-text pairs), the reality gap narrows dramatically — the encoder has already seen millions of real kitchens, tables, and cups. The visual features it extracts from a simulated scene are close to what it would extract from the real scene, because it learned to represent semantics, not pixel statistics.

This is a fundamental shift from pre-VLA sim-to-real. Traditional approaches had to randomize textures and lighting until the sim "covered" the real distribution. With a foundation-model vision encoder, the representations are already invariant to these variations. The remaining gap is in physics: contact dynamics, friction, deformable objects. Domain randomization still helps here, but the visual gap — historically the hardest part — is largely solved by pre-training.

The practical workflow for a VLA team today: (1) collect ~50-100 real demos for a new task (~5 hours of teleop), (2) optionally augment with 10K sim demos using domain randomization, (3) fine-tune the pre-trained VLA on this combined dataset for a few hours on 8 GPUs, (4) deploy and iterate. The pre-trained VLA already knows general manipulation from OXE data; the fine-tuning teaches it the specific task and environment. This recipe can get a new pick-and-place task working in under 24 hours.

Check: What is domain randomization?

Training in multiple real environments Randomly choosing which robot to use Randomizing visual/physical properties in simulation to improve transfer

Chapter 9: The Embodied Future

We're at the beginning of the VLA era. Current models can follow simple instructions in controlled settings, but the gap to human-level dexterity and generalization remains vast. Let's ground the current state in real numbers.

Real-world latency budget: A VLA control loop in production: camera capture (~5ms) + image preprocessing (~2ms) + VLM forward pass (~50ms) + action denoising (~20ms) + motor command (~5ms) = ~82ms total → ~12 Hz. With action chunking (k=50 at 50Hz), one inference call covers 1 second of motion. Most manipulation tasks need only 5-10 Hz, so current VLAs are fast enough — the bottleneck is not speed but generalization.

Key open challenges:

Challenge	Current State	What's Needed
Dexterity	Basic grasping	In-hand manipulation, tool use
Long-horizon	1-2 step tasks	Multi-step planning, error recovery
Safety	Lab environments	Real-world safety guarantees
Data scale	~1M episodes	Internet-scale robot data
Speed	~12Hz with chunking	Reactive control for dynamic tasks (100Hz+)

VLA Timeline

Key milestones in the journey from pure VLMs to embodied agents.

Camera capture

~5ms

↓

Image preprocess

~2ms (resize, normalize)

↓

VLM forward pass

~50ms (ViT + LLM)

↓

Action denoising

~20ms (10 flow steps)

↓

Motor command

~5ms (to robot controller)

The scaling hypothesis for robotics: Just as language models improved dramatically with scale, the bet is that robot foundation models will too — given enough data, diverse embodiments, and compute. Projects like DROID (50K demos from 13 sites) and OXE are building this data flywheel.

"The last grand challenge of AI is to give minds a body."

— Fei-Fei Li

You now understand the path from seeing to acting — the data flow from camera pixels to motor commands, the engineering choices of action representations and chunking, and the architectural decisions that bridge billion-parameter language models to 7-DOF robot arms. VLAs are teaching machines to reach out and touch the world.

Check: What is the biggest bottleneck for scaling VLAs?

Lack of good vision encoders Scarcity of diverse, high-quality robot demonstration data Language models are too slow

Understand Vision-Language-ActionModels