8-Part Technical Series

Vision-Language-Action Models

Where perception meets physical action — from the fundamentals of imitation learning to the foundation models teaching robots to see, understand language, and act in the real world. RT-2, OpenVLA, and the path to generalist embodied intelligence. Built for engineers who want to understand, not just use.

8 Articles
~280 Min total read
30+ Interactive demos
01
Foundations

Embodied AI Foundations

The robot learning problem from first principles — observation and action spaces, MDPs, simulation environments, the reality gap, and why embodied intelligence is fundamentally different from language.

02
Core Theory

Imitation Learning & Behavioral Cloning

Learning from demonstrations, the behavioral cloning objective, distribution shift and compounding errors, DAgger, and when imitation beats reinforcement learning.

03
Architecture

Vision Encoders for Robotics

Spatial representations for manipulation — depth sensing, point clouds, 3D scene understanding, pretrained visual features, and what robots need to see that ImageNet doesn't teach.

04
Core Theory

Language-Conditioned Policies

Task specification through natural language, grounding instructions to motor commands, language embeddings as goal representations, and multi-task policies.

05
Architecture

VLA Architectures

RT-1's tokenized actions, RT-2's vision-language-action transfer, Octo's modular design, OpenVLA's open-source recipe, and how foundation models become robot policies.

06
Training

Training Data & Pipelines

Open X-Embodiment and DROID datasets, teleoperation and data collection, sim-to-real transfer, domain randomization, data scaling laws, and cross-embodiment generalization.

07
Capabilities

Planning & Reasoning

SayCan and affordance-based planning, inner monologue, code-as-policy, LLM-driven task decomposition, world models, and chain-of-thought reasoning for robots.

08
Applications

Deployment & Frontiers

Real-world robustness and safety, dexterous manipulation, humanoid robots, generalist policies, multi-robot coordination, and the path to foundation models that act.