The Complete Beginner's Path

Understand Vision-Language-Action
Models

How foundation models learned to move robots — translating pixels and instructions into physical actions in the real world.

Prerequisites: Basic ML intuition + Curiosity about robotics. That's it.
10
Chapters
8+
Simulations
0
Robotics Background Needed

Chapter 0: From Seeing to Acting

Vision-Language Models (VLMs) can look at an image and answer questions. But what if instead of answering with words, the model answered with motor commands? That's the leap from VLM to VLA (Vision-Language-Action): a model that sees the world, understands a language instruction, and outputs physical actions.

The insight is deceptively simple: if a VLM can generate the token sequence "pick up the red cup," why can't it instead generate the action sequence [move_to(0.3, 0.5, 0.2), close_gripper()]? A VLA treats actions as just another modality — another kind of output token.

The big idea: A VLA is a VLM whose output vocabulary has been extended to include robot actions. See + Understand + Act, all in one forward pass.
VLM vs VLA Pipeline

Watch how the same backbone produces different outputs: text for VLMs, actions for VLAs.

Check: What is the key difference between a VLM and a VLA?

Chapter 1: Behavioral Cloning

The simplest way to teach a robot: show it what to do. A human demonstrates the task (teleoperation), recording observations and actions. Then we train a neural network to predict the expert's action given the current observation. This is behavioral cloning (BC) — supervised learning for robotics.

π(a | o) = argmin Σ ||fθ(ot) − at||²

BC is simple but fragile. If the robot drifts slightly off the demonstrated path, it encounters states it has never seen before — the distribution shift problem. Small errors compound, and the robot spirals into failure.

Behavioral Cloning: Compounding Errors

The teal path is the expert demo. The red path is the BC agent. Watch errors compound as it drifts away.

Noise level1.5
Why it still matters: Despite its flaws, BC is the data-collection backbone of all VLAs. RT-1, RT-2, and OpenVLA all learn from human demonstrations. The trick is combining BC with architectures powerful enough to generalize.
Check: What is the main failure mode of behavioral cloning?

Chapter 2: Action Representations

How do you represent "move the arm"? This choice is critical. Common representations include:

RepresentationFormatProsCons
Joint angles1,...,θ7]Direct motor controlRobot-specific
End-effector pose[x, y, z, rx, ry, rz, grip]Intuitive, transferableNeeds IK solver
Discrete binsToken index per dimWorks with language modelsLoses precision
Continuous vectorsRaw floatsFull precisionHarder for LLMs
Action Discretization

Drag the slider to change the number of bins. More bins = more precision but larger vocabulary. The green dot shows the discretized position.

Bins per dim16
The RT-2 trick: Discretize each action dimension into 256 bins, then map each bin to an existing text token (like the numbers 0-255). This lets a language model output actions without any architectural changes!
Check: Why do VLA models often discretize continuous actions?

Chapter 3: RT-2 — Language as Action

RT-2 (Robotics Transformer 2) by Google DeepMind is the landmark paper that proved VLMs can directly control robots. The key insight: fine-tune a VLM so that instead of generating text, it generates action tokens.

RT-2 takes a PaLM-E or PaLI-X vision-language model, discretizes robot actions into 256 bins per dimension, maps them to string tokens ("128", "64", "255"), and co-fine-tunes on robot demonstrations alongside web-scale vision-language data.

Input
[Camera image] + "Pick up the bottle"
VLM Backbone
PaLM-E (55B) or PaLI-X (55B)
Output
"1 128 91 241 5 127 100" (7 action dims)
RT-2 Action Tokenization

See how a continuous 7-DOF action is tokenized into text. Each dimension maps to a bin number.

Why this works: VLMs pretrained on internet data already understand spatial concepts, object relationships, and goals. RT-2 showed that this knowledge transfers: "pick up the bottle near the apple" works even if the robot never practiced that exact arrangement.
Check: How does RT-2 represent robot actions?

Chapter 4: Diffusion Policies

What if instead of predicting a single action, the robot could sample from a distribution over actions? Diffusion Policy applies the same denoising process used in image generation to robot action prediction.

Starting from random noise, the model iteratively refines an action trajectory. This naturally handles multimodal action distributions — when there are multiple valid ways to do something (reach from the left or right), the diffusion model can represent all of them.

a0 ~ N(0, I)  →  a1 → ... → aT = denoised action
Diffusion Denoising for Actions

Watch random noise get denoised into a clean action trajectory. Each step removes noise. The green path is the final clean trajectory.

Why not just regression? Regression predicts the average action. When two valid paths exist (go left or go right), the average is the middle — which might be invalid (crashing into the obstacle). Diffusion can represent multiple modes.
Check: What problem does Diffusion Policy solve that regression can't?

Chapter 5: Action Chunking

Predicting one action at a time is reactive and jerky. Action Chunking (from ACT — Action Chunking with Transformers) predicts an entire sequence of future actions at once. Instead of "what should I do now?", the model answers "what should I do for the next H steps?"

This turns control into a trajectory prediction problem. The robot executes a chunk of actions, then re-plans. This produces smoother motion and handles temporal correlation in the demonstrations.

Single-Step vs Chunked Actions

Compare single-step (reactive) control with chunked (planned) control. Notice how chunking is smoother.

Chunk size H8
Temporal ensemble: ACT uses an exponential moving average over overlapping chunks to further smooth the executed trajectory. Each timestep has predictions from multiple overlapping chunks — blending them eliminates jitter.
Check: What does action chunking predict?

Chapter 6: OpenVLA Architecture

OpenVLA is the open-source counterpart to RT-2: a 7B-parameter VLA built on Llama 2 + SigLIP, trained on the Open X-Embodiment dataset. It demonstrated that smaller, open models can rival proprietary giant VLAs.

SigLIP Vision Encoder
224×224 image → vision tokens
Projection MLP
Align vision features to LLM space
Llama 2 (7B)
[vision tokens] + "pick up the cup" → action tokens
De-tokenize
256-bin tokens → 7-DOF continuous action
OpenVLA Token Flow

Trace the path from image + instruction to robot action. Each color represents a different processing stage.

Training data: OpenVLA trained on 970K episodes from Open X-Embodiment spanning 22 robot types and hundreds of tasks. This diversity is what gives it generalization — it's seen enough variety to handle novel situations.
Check: What is OpenVLA's language model backbone?

Chapter 7: Cross-Embodiment

A human can watch someone else cook and learn the recipe, even though their arms are different lengths. Can robots do the same? Cross-embodiment learning trains on data from many different robot types, hoping that high-level task knowledge transfers even when the hardware differs.

The Open X-Embodiment (OXE) dataset combined demonstrations from 22+ robot embodiments: single-arm manipulators, dual-arms, quadrupeds, dexterous hands. The finding: a single policy trained on all this data outperforms robot-specific policies on most tasks.

Cross-Embodiment Transfer

Different robots contribute demonstrations to a shared policy. Toggle embodiments on/off.

The positive transfer hypothesis: Even though a WidowX arm and a Franka Panda have different kinematics, they share the same visual semantics ("what is a cup?") and task structures ("pick, then place"). It's this shared structure that enables transfer.
Check: Why does cross-embodiment training help?

Chapter 8: Sim-to-Real Transfer

Real robot data is expensive and slow to collect. Simulation is cheap and infinitely scalable. But policies trained in sim often fail in the real world — the reality gap. Visual differences (lighting, textures), physics mismatches (friction, contact), and sensor noise all contribute.

The Reality Gap

See how domain randomization helps bridge sim and real. Each refresh randomizes the simulated environment.

Domain randomization50%
TechniqueHow
Domain RandomizationRandomize textures, lighting, physics in sim
System IdentificationCalibrate sim to match real physics
Progressive NetsTrain in sim, fine-tune with little real data
Foundation ModelsVLMs pretrained on real images reduce the visual gap
VLAs as a bridge: VLMs pretrained on internet images already understand real-world visual features. When a VLA uses a pretrained vision encoder, the reality gap narrows dramatically — the encoder has already seen millions of real kitchens, tables, and cups.
Check: What is domain randomization?

Chapter 9: The Embodied Future

We're at the beginning of the VLA era. Current models can follow simple instructions in controlled settings, but the gap to human-level dexterity and generalization remains vast. Key open challenges:

ChallengeCurrent StateWhat's Needed
DexterityBasic graspingIn-hand manipulation, tool use
Long-horizon1-2 step tasksMulti-step planning, error recovery
SafetyLab environmentsReal-world safety guarantees
Data scale~1M episodesInternet-scale robot data
Speed~3Hz controlReal-time reactivity (100Hz+)
VLA Timeline

Key milestones in the journey from pure VLMs to embodied agents.

The scaling hypothesis for robotics: Just as language models improved dramatically with scale, the bet is that robot foundation models will too — given enough data, diverse embodiments, and compute. Projects like DROID (50K demos from 13 sites) and OXE are building this data flywheel.
"The last grand challenge of AI is to give minds a body."
— Fei-Fei Li

You now understand the path from seeing to acting. VLAs are teaching machines to reach out and touch the world.

Check: What is the biggest bottleneck for scaling VLAs?