Embodied AI Foundations — Engineermaxxing

Introduction

In 2023, large language models crossed a threshold: they could write essays, debug code, pass medical exams, and hold open-ended conversations indistinguishable from human output. But ask the most capable LLM to pick up a coffee mug and it can only describe the process. It cannot do it.

This is not a minor gap. It is the central unsolved problem of artificial intelligence. Intelligence evolved in animals not to answer questions, but to act in the world — to find food, avoid predators, manipulate objects, navigate terrain. Language is a recent evolutionary invention; sensorimotor control is ancient. A crow that uses a stick to extract insects from a log is solving a problem that no language model can solve, despite the crow having roughly 1.5 billion neurons compared to GPT-4's estimated 1.8 trillion parameters.

Embodied AI is the subfield of artificial intelligence concerned with agents that perceive and act in physical (or physically-simulated) environments. Unlike language models, which operate on static datasets of text, embodied agents face a fundamentally different learning problem: actions have consequences, consequences are often irreversible, and the feedback loop between perception, decision, and action must operate in real time.

This article lays the mathematical and conceptual groundwork for the entire VLA series. We will formalize the robot learning problem using Markov Decision Processes, explore the observation and action spaces that define what a robot can sense and do, survey the simulation environments where most robot learning happens today, and confront the reality gap — the persistent challenge of transferring learned behaviors from simulation to the physical world.

ℹ What this article covers

We start from first principles: what makes embodied AI fundamentally different from language modeling, how to formalize the robot learning problem as an MDP/POMDP, the observation and action spaces that define a robot's interface to the world, major simulation platforms, the reality gap and approaches to bridging it, and the hardware platforms that anchor current VLA research. No prior robotics knowledge is assumed.

What Makes Embodied AI Different

A language model processes a sequence of tokens and produces the next token. The input is a static snapshot of text. The output is a probability distribution over a fixed vocabulary. The model never changes the world it reads from — your prompt is the same before and after the model responds.

An embodied agent is fundamentally different in at least four ways, and understanding these differences is essential before we can meaningfully discuss how to build systems that bridge language understanding and physical action.

Closed-loop interaction

When a robot reaches for an object, its action changes the state of the world. The object moves. The robot's arm is now in a different configuration. The next observation reflects these changes. The agent must then decide its next action based on the new state, not the original one. This creates a closed loop: perception → action → new perception → new action.

In a language model, each token prediction is conditioned on the full preceding context, but that context was fixed before the model began generating. In an embodied agent, the "context" is the physical world, and every action the agent takes modifies that context. This means:

Errors compound. A slightly wrong grasp angle at timestep 5 may cause the object to slip at timestep 20, leading to a completely different trajectory. There is no "backspace."
Exploration is costly. In language, generating a bad token costs nothing — you discard it and try again. In robotics, a bad action can damage hardware, break objects, or injure people.
Timing matters. Actions must be executed at specific frequencies (typically 5–20 Hz for manipulation, 50–1000 Hz for locomotion). A policy that takes too long to compute is useless, regardless of its quality.

Physical constraints

Language models operate in an unconstrained symbolic space. A model can "move" from discussing quantum physics to writing Python to composing poetry with no friction. Physical agents face hard constraints:

Kinematics. A 7-degree-of-freedom robot arm can only reach a subset of positions and orientations (its workspace). Some poses are physically impossible; others cause self-collision.
Dynamics. Objects have mass, friction, and elasticity. A policy must account for the weight of an object when deciding grasp force. Too little force and the object slips; too much and it deforms or breaks.
Irreversibility. If a robot drops a glass, it shatters. If a robot pushes a block off a table, the block falls. Many actions cannot be undone. This contrasts sharply with the reversibility of token generation.
Partial observability. A camera sees only one view. Objects occlude each other. The robot cannot see inside a closed drawer. Tactile sensors feel only what they touch. The agent must act under persistent uncertainty about the true world state.

💡 The Moravec paradox

Hans Moravec observed in 1988 that "it is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility." This insight — now called Moravec's Paradox — remains remarkably relevant. We have LLMs that pass the bar exam but no robot that can reliably fold laundry. The high-level reasoning that seems hard to humans is computationally cheap; the sensorimotor skills that seem effortless are computationally expensive.

The MDP Framework

To reason formally about robot learning, we need a mathematical framework that captures the essential structure: an agent observes the world, takes an action, the world transitions to a new state, and the agent receives feedback. The standard formalization is the Markov Decision Process (MDP).

Formal definition

An MDP is defined by a tuple (S, A, T, R, γ):

S — State space: the set of all possible world states A — Action space: the set of all actions the agent can take T(s' | s, a) — Transition function: probability of reaching state s' given state s and action a R(s, a) — Reward function: scalar feedback after taking action a in state s γ \in [0, 1) — Discount factor: how much future rewards are worth relative to immediate ones

A policy π(a | s) maps states to actions (or distributions over actions). The goal is to find the policy that maximizes the expected cumulative discounted reward:

J(π) = E π [\sum t=0 \infty γ t R(s t, a t)]

The Markov property is the key assumption: the future state depends only on the current state and action, not on the history. Formally: T(s_t+1 | s₀, a₀, ..., s_t, a_t) = T(s_t+1 | s_t, a_t). This is a strong assumption — in practice, robots often violate it because the "state" is defined by observations that don't capture everything relevant — but it makes the mathematics tractable.

For a concrete robotics example: consider a robot arm reaching for a cup on a table. The state s includes the joint angles of the arm, joint velocities, the position and orientation of the cup, and physical properties like friction coefficients. The action a might be a vector of torques applied to each joint motor. The transition function T is determined by physics (Newton's laws, contact mechanics). The reward R could be -1 at each timestep until the robot grasps the cup (encouraging speed), or it could be a shaped function based on the distance between the gripper and the cup.

Partial observability (POMDP)

In practice, robots never have access to the full state. A camera image does not reveal the mass of an object, the friction coefficient of a surface, or what is behind an occluding wall. This motivates the Partially Observable MDP (POMDP), which extends the MDP with:

Ω — Observation space: what the agent actually perceives O(o | s', a) — Observation function: probability of seeing observation o after transitioning to s' via action a

Now the agent receives observations o ∈ Ω rather than states s ∈ S. The policy must map observations (or observation histories) to actions: π(a | o₁, o₂, ..., o_t). In theory, the optimal POMDP policy maintains a belief state — a probability distribution over the true state given all past observations and actions. In practice, this is intractable for continuous state spaces, and modern approaches use neural networks to learn representations that implicitly maintain relevant state information.

This is where the connection to VLAs begins to crystallize. A VLA takes an image observation (from a camera) and a language instruction (the task specification), and outputs an action. The image is a partial observation of the scene. The language instruction can be thought of as specifying the reward function implicitly — "pick up the red cup" defines what counts as success without requiring an explicit mathematical reward. The VLA's learned representations serve as an implicit belief state, aggregating information from visual features, language embeddings, and potentially past observations.

ℹ MDP vs. POMDP in practice

Most robotics papers describe their setup using MDP notation for simplicity, even though the real system is a POMDP. The implicit assumption is that the observation (e.g., a camera image plus proprioceptive state) is "close enough" to the full state for practical purposes. This works surprisingly well for many manipulation tasks, but fails when objects are occluded, when physical properties (mass, friction) matter, or when the task requires memory of past events. VLAs that process only the current frame operate as reactive policies; those that process observation histories (e.g., through recurrence or attention over past frames) can handle partial observability more gracefully.

MDP — Agent-Environment Loop Interactive

Click Step to advance the agent through the MDP loop. Watch how the state changes after each action and reward is received.

Timestep 0 — initial state

Observation Spaces

The observation space defines what information the robot has access to at each timestep. This is perhaps the most important design decision in the entire robot learning pipeline, because it determines the upper bound on what the policy can possibly learn. A policy cannot react to information it never sees.

Proprioception

Proprioception is the robot's sense of its own body configuration. In humans, proprioceptive sensors in muscles and joints tell you the position of your arm even with your eyes closed. In robots, encoders on each joint motor provide precise measurements of joint angles and velocities.

For a typical 7-DOF robot arm (like the Franka Emika Panda, widely used in research), the proprioceptive observation is a vector of:

Joint positions: 7 angles in radians, q ∈ R⁷
Joint velocities: 7 angular velocities, q̇ ∈ R⁷
Joint torques: 7 measured torques (if available), τ ∈ R⁷
Gripper state: opening width (continuous) or open/closed (binary)

Proprioception is high-bandwidth (typically available at 1 kHz), low-latency, and noise-free compared to vision. Many simple control policies operate on proprioception alone (e.g., reaching to a known target position). But proprioception tells the robot nothing about the external world — where objects are, what they look like, or how the scene is changing.

Visual observations

Vision is the richest sensory channel for most manipulation tasks. A single camera image provides information about object identities, positions, orientations, colors, textures, and spatial relationships — all crucial for deciding how to act. But visual observations are also high-dimensional, noisy, and ambiguous.

Common visual observation types in robotics:

RGB Images

Standard Camera

Typically 224×224 or 256×256 pixels, 3 color channels. The workhorse of VLA research. Most pretrained vision encoders (ViT, ResNet) expect this format. Provides appearance information but no explicit depth.

Depth Maps

Depth Camera (RGBD)

A single-channel image where each pixel value encodes distance from the camera. Intel RealSense and Microsoft Azure Kinect are common depth sensors. Crucial for 3D reasoning about object positions and grasping geometry.

Point Clouds

3D Point Sets

Unordered sets of 3D points (x, y, z), often with color (r, g, b) or normal vectors. Derived from depth cameras or LiDAR. Processed by networks like PointNet or PointNet++. Provide explicit 3D geometry.

Multi-view

Multiple Cameras

Many setups use 2–4 cameras: a third-person view for scene context, a wrist-mounted camera for close-up manipulation detail, and sometimes additional side views. Multi-view input provides partial 3D understanding through view synthesis.

A key design decision is how many cameras to use and where to place them. Google's RT-1 and RT-2 models use a single third-person RGB camera. Many research setups add a wrist camera for fine manipulation. The Bridge V2 dataset (a major multi-task manipulation dataset) uses a single third-person camera. The choice between single-view and multi-view involves a fundamental tradeoff: more views provide richer spatial information but increase the observation dimensionality and require the policy to fuse information across viewpoints.

Tactile and force sensing

Touch provides information that vision cannot: contact forces, surface texture, object stiffness, and slip detection. A robot grasping a paper cup needs to apply enough force to hold it but not so much that it crushes — this requires force feedback, not just visual confirmation of contact.

Force/torque sensors mounted at the wrist measure the 6-DOF wrench (3 forces + 3 torques) at the end-effector. These are standard on most research robots and essential for contact-rich tasks like insertion, polishing, and assembly.

Tactile sensors like GelSight, DIGIT, and BioTac provide high-resolution contact geometry and force distribution. GelSight, for instance, uses an elastomeric gel with embedded markers and a camera to produce a dense tactile image at each fingertip — essentially giving the robot a "sense of touch" with spatial resolution comparable to human fingertips. While promising, tactile sensing remains less standardized than vision, and most current VLA research focuses primarily on visual observations.

Observation Space Explorer Interactive

Toggle different observation modalities to see what information they provide to the robot policy.

RGB: 224×224×3 = 150,528 dimensions

Action Spaces

If observation spaces define what the robot sees, action spaces define what it can do. The action space design determines the granularity of control, the difficulty of the learning problem, and the kinds of behaviors the policy can express.

Joint-level control

The most direct form of control: the policy outputs a target value for each joint. For a 7-DOF arm with a 1-DOF gripper, this is an 8-dimensional continuous action vector.

Joint-level actions come in several flavors:

Joint position targets: the policy specifies desired joint angles, and a low-level PD controller tracks them. This is the most common in learning-based approaches because it abstracts away dynamics and is relatively safe (position-controlled robots have natural compliance).
Joint velocity targets: the policy specifies desired joint angular velocities. Useful for continuous motion but requires more careful safety constraints.
Joint torques: the policy directly specifies motor torques. This gives maximum control authority but is the hardest to learn (the policy must implicitly solve the dynamics) and the most dangerous (unbounded torques can damage hardware).

Cartesian (end-effector) control

Instead of controlling individual joints, the policy specifies the desired position and orientation of the end-effector (gripper) in Cartesian space. The action is typically a 7-dimensional vector:

a = [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]

where the first six components are translational and rotational velocity (or position delta) commands, and the last is a gripper action (open/close or continuous width). An inverse kinematics (IK) solver converts these Cartesian targets to joint commands.

Cartesian control is widely used in VLA research because it is more intuitive — "move the gripper 2cm to the right" is easier to learn from demonstrations than "rotate joint 3 by 0.05 radians" — and it generalizes better across robots with different kinematics. Google's RT-1 and RT-2 both use 7-dimensional Cartesian action spaces. OpenVLA uses a similar formulation.

Discrete vs. continuous actions

A key design decision is whether to treat actions as continuous vectors or discretize them into bins. This choice has profound implications for the policy architecture:

Continuous

Direct regression

The policy outputs a continuous action vector, typically through a Gaussian head (predicting mean and variance) or a deterministic head (predicting the mean only). Natural for robotics but can suffer from multimodal action distributions — when multiple valid actions exist, a Gaussian policy may average them, producing an invalid intermediate action.

Discretized

Tokenized actions

Each continuous action dimension is binned into discrete tokens (e.g., 256 bins per dimension). The policy then predicts a categorical distribution over bins, which naturally handles multimodality. This is the approach used by RT-2 and other VLA models that leverage language model architectures — actions become "tokens" predicted autoregressively.

The discretization approach has become dominant in VLA research because it allows direct reuse of language model architectures and training procedures. RT-2 discretizes each action dimension into 256 bins, mapping each bin to a unique token in the model's vocabulary. The action [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper] becomes a sequence of 7 tokens, predicted autoregressively just like text tokens. This elegant reformulation turns the robot control problem into a sequence prediction problem — exactly the problem that large language models are designed to solve.

💡 Action frequency and latency

Most manipulation policies run at 3–10 Hz (3–10 actions per second). RT-1 runs at 3 Hz; many research systems operate at 5 Hz. This is far slower than traditional robot controllers (which run at 100–1000 Hz) but fast enough for most pick-and-place and tabletop manipulation tasks. The inference latency of the policy — the time to process an observation and produce an action — is a hard constraint. A VLA based on a 7B-parameter language model may take 100–300ms per forward pass on a high-end GPU, limiting the control frequency to 3–10 Hz. This is one reason why action chunking (predicting multiple future actions at once) has become important — it allows the robot to continue executing while the next chunk is computed.

Action Space Comparison — Continuous vs. Discretized Interactive

Compare how a continuous Gaussian policy and a discretized (tokenized) policy represent the same action distribution. Notice how discretization naturally captures multimodal distributions.

Unimodal — single clear best action

Simulation Environments

Training a robot policy requires thousands to millions of episodes of experience. Collecting this data on physical hardware is slow (real-time execution), expensive (hardware wear, human supervision), and dangerous (crashes, drops, collisions). Simulation offers a compelling alternative: episodes run faster than real time, hardware never breaks, and the environment can be reset instantly.

This is why the vast majority of robot learning research begins in simulation. A policy trained in simulation can then be transferred to a real robot — but this transfer is far from trivial, as we'll discuss in the next section.

Physics engines

A robotics simulator must solve two problems: physics simulation (computing how objects move, collide, and interact under forces) and rendering (generating visual observations that approximate real camera images).

Physics simulation involves integrating Newton's equations of motion at discrete timesteps, detecting and resolving collisions between rigid and deformable bodies, and simulating contact forces (friction, restitution, compliance). The most widely used physics engines in robot learning are:

MuJoCo (Multi-Joint dynamics with Contact): Originally developed by Emo Todorov at the University of Washington, acquired by DeepMind in 2021 and made open-source. MuJoCo uses a convex optimization-based contact solver that provides smooth, differentiable contact dynamics. It is the most popular physics engine in robot learning research, used by the majority of manipulation and locomotion benchmarks.
PhysX (NVIDIA): A GPU-accelerated physics engine originally developed for gaming. NVIDIA's Isaac Sim builds on PhysX to provide photorealistic rendering and massive parallelism — thousands of environments can run simultaneously on a single GPU.
Bullet (PyBullet): An open-source physics engine with a Python interface. Widely used in earlier robot learning research (pre-2022) and in benchmark suites like pybullet-gym. Less accurate than MuJoCo for contact-rich manipulation but easier to set up.
Drake (MIT/Toyota Research Institute): Focuses on mathematical rigor and provides tools for formal verification and trajectory optimization alongside simulation. Used primarily in planning-centric research.

Major simulation platforms

Building on these physics engines, several integrated simulation platforms provide complete environments for robot learning research:

NVIDIA Isaac Sim

GPU-Accelerated

Photorealistic rendering via RTX ray tracing, GPU-parallel physics, synthetic data generation (domain randomization built in). Supports thousands of parallel environments on a single GPU. Used for large-scale RL and sim-to-real transfer.

MuJoCo + dm_control

Research Standard

DeepMind's control suite provides standardized benchmarks (cartpole, humanoid, manipulation) atop MuJoCo. Fast CPU-based simulation with accurate contact dynamics. The default choice for locomotion and manipulation research.

Meta-World / robosuite

Manipulation Benchmarks

Meta-World provides 50 manipulation tasks for multi-task and meta-learning research. robosuite (Stanford) offers modular task composition with the Franka Panda robot. Both build on MuJoCo and are standard benchmarks for policy learning papers.

ℹ Simulation speed

Modern GPU-accelerated simulators like Isaac Sim can run 4,000+ environments in parallel at real-time or faster, generating millions of transitions per hour on a single A100 GPU. MuJoCo on CPU typically runs a single environment at 10–100x real time. This speed difference is why GPU-based simulation has become dominant for large-scale RL training, while CPU-based simulators remain preferred for smaller-scale experiments and debugging.

Simulation Platforms — Feature Comparison Interactive

Hover over each platform to see its strengths across key dimensions.

The Reality Gap

A policy that achieves 95% success rate in simulation may achieve 30% on a real robot. This discrepancy — the reality gap — is the central challenge of simulation-based robot learning. It arises because simulations are imperfect models of the real world:

Visual gap. Simulated images look different from real camera images. Lighting, textures, reflections, shadows, sensor noise, lens distortion, and color balance all differ. A policy trained on rendered images may fail when it encounters real-world visual complexity.
Dynamics gap. Simulated physics are approximations. Contact dynamics (friction, deformation, slipping) are particularly hard to model accurately. A simulated grasp that succeeds may fail in reality because the real object has slightly different friction or mass.
Actuator gap. Simulated motors respond instantly and perfectly to commands. Real motors have latency, backlash, compliance, and current limits. The gap between commanded and actual joint positions can be significant, especially during fast motions or contact.

Sim-to-real transfer

Several strategies have been developed to bridge the reality gap. They can be broadly categorized into three approaches:

System identification: Measure the physical parameters of the real system (object masses, friction coefficients, motor gains) and configure the simulator to match as closely as possible. This works for well-characterized systems but fails when the relevant parameters are unknown or vary across objects (e.g., different cups have different friction).

Domain adaptation: Train a model to transform simulated observations to look like real observations (or vice versa). CycleGAN and other image-to-image translation methods can map rendered images to photorealistic versions. However, this adds another learned component that can introduce its own errors.

Domain randomization: Instead of making the simulator match reality exactly, randomize the simulation parameters across a wide range during training. The hope is that the real world is just "another randomized instance" that falls within the distribution the policy has been trained on. This is the most widely used approach in current practice.

Domain randomization in detail

Domain randomization was introduced by Tobin et al. (2017) at OpenAI for transfer of object detection models and has since become standard practice in sim-to-real robot learning. The idea is simple but powerful: during training, randomize everything that might differ between simulation and reality.

Common randomization parameters include:

Category	Parameters Randomized	Typical Range
Visual	Lighting position, intensity, color; texture colors and patterns; camera position and angle; background	Wide uniform ranges
Physical	Object mass, friction, damping; table height; gripper friction	±30–50% of nominal
Dynamics	Action delay, control frequency, motor gains; observation noise	±20–40% of nominal
Geometric	Object size, shape variation; initial position distribution; obstacle placement	Task-dependent

The key insight of domain randomization is that it trades simulation fidelity for policy robustness. A policy trained in a single, carefully-calibrated simulator may overfit to that specific simulator's quirks. A policy trained across thousands of randomized variations must learn features and behaviors that are invariant to the specifics of any one setting — and real-world conditions are just another setting in this distribution.

OpenAI's landmark 2019 demonstration of Rubik's cube solving by a dexterous robotic hand (the Dactyl project) exemplified this approach: the policy was trained entirely in simulation with aggressive domain randomization, including randomizing gravity, object size, friction, actuator gains, and visual appearance. Despite the enormous reality gap for dexterous manipulation, the policy transferred successfully to the physical robot hand.

💡 The shift away from sim-to-real

While sim-to-real remains important, the VLA revolution has shifted the field's emphasis toward learning from real-world demonstrations. RT-1 was trained on 130,000 real robot episodes collected over 17 months. The Open X-Embodiment dataset aggregates over 1 million real episodes from 22 robot types. This shift is driven by two factors: (1) large-scale data collection has become more efficient through fleet learning (Google used a fleet of 13 identical robots), and (2) the visual and dynamics gaps remain difficult to close for diverse, unstructured environments. Current VLA research typically uses real demonstration data for policy learning and reserves simulation for initial prototyping and ablation studies.

Robot Platforms

VLA research is anchored to specific hardware platforms. The choice of robot determines the action space dimensions, the achievable manipulation capabilities, and the cost of data collection. A few platforms dominate current research:

Robot	Type	DOF	Key Users	Notable For
Google Robot	Mobile manipulator	7+1 arm + base	Google DeepMind	RT-1, RT-2 training platform; fleet of 13 robots for large-scale data
Franka Emika Panda	Desktop arm	7+1	Most university labs	Research workhorse; used in Bridge, DROID, many benchmarks
WidowX 250	Desktop arm	6+1	UC Berkeley, Stanford	Low cost (~$3,000); used in Bridge V2 dataset
xArm	Desktop arm	6+1 / 7+1	Various labs	Cost-effective alternative to Panda with good precision
ALOHA (Bi-arm)	Dual-arm teleoperation	2×(6+1)	Stanford, Google	Low-cost bimanual platform for demonstration collection

An important trend is the development of low-cost, open-source robot platforms designed specifically for data collection. Stanford's ALOHA system, which uses two ViperX 300 arms controlled through a matched pair of leader arms for whole-body teleoperation, costs under $20,000 and has enabled several groups to collect large-scale bimanual manipulation datasets. The Mobile ALOHA variant adds a mobile base, extending the platform to household tasks like cooking and cleaning.

The diversity of robot platforms creates a challenge for generalization. A policy trained on the Google Robot's 7-DOF arm with a specific gripper may not transfer to a Franka Panda with a different gripper geometry and different dynamics. This is the cross-embodiment generalization problem — one of the central motivations for foundation model approaches to robotics, which we'll explore in detail in Article 06 on training data and pipelines.

ℹ A brief history of embodied AI milestones

1961

Unimate — first industrial robot arm deployed at GM, performing die casting. Purely programmed, no learning.

1970

Shakey the Robot (SRI) — first mobile robot integrating perception, planning, and action. Could navigate rooms and push objects.

2013

DQN (DeepMind) — deep RL achieves superhuman performance on Atari games, demonstrating that neural networks can learn policies from pixels.

2016

Levine et al. — large-scale robotic grasping with deep learning. 800,000 grasp attempts across 14 robots over 2 months.

2019

OpenAI Dactyl — Rubik's cube solving with a dexterous hand, trained entirely in simulation with domain randomization.

2022

RT-1 (Google) — Robotics Transformer trained on 130K real episodes. First demonstration of transformer-scale learning for real robot control.

2023

RT-2 (Google) — first vision-language-action model. A VLM that directly outputs robot actions as tokens, demonstrating emergent reasoning capabilities.

2024

OpenVLA, Octo, π₀ — open-source VLA models, cross-embodiment pretraining, and flow-matching action heads push the field toward generalist robot policies.

Code Examples

Let's ground these concepts with working code. We'll set up a basic manipulation environment in MuJoCo, inspect the observation and action spaces, and run a simple policy.

Setting up a MuJoCo environment

python

import gymnasium as gym
import numpy as np

# Create a Franka Panda reach task from gymnasium-robotics
env = gym.make("FrankaKitchen-v1", render_mode="human")

# Inspect the observation space
obs, info = env.reset()
print(f"Observation keys: {obs.keys()}")
print(f"  observation shape: {obs['observation'].shape}")  # robot state
print(f"  desired_goal shape: {obs['desired_goal'].shape}")  # target
print(f"  achieved_goal shape: {obs['achieved_goal'].shape}")  # current

# Inspect the action space
print(f"\nAction space: {env.action_space}")
print(f"  shape: {env.action_space.shape}")  # (9,) for Franka
print(f"  low:  {env.action_space.low[:3]}...")
print(f"  high: {env.action_space.high[:3]}...")

# Run a random policy for one episode
obs, info = env.reset()
total_reward = 0
for step in range(200):
    action = env.action_space.sample()  # random action
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    if terminated or truncated:
        break

print(f"\nEpisode finished after {step+1} steps")
print(f"Total reward: {total_reward:.2f}")
env.close()

Inspecting a robosuite environment

python

import robosuite as suite
import numpy as np

# Create a pick-and-place environment with a Panda arm
env = suite.make(
    env_name="Lift",
    robots="Panda",
    has_renderer=True,
    has_offscreen_renderer=True,
    use_camera_obs=True,          # include camera images
    camera_names="agentview",     # third-person camera
    camera_heights=224,
    camera_widths=224,
    control_freq=20,              # 20 Hz control
    controller_configs=suite.load_controller_config(
        default_controller="OSC_POSE"  # Operational Space Control
    ),
)

obs = env.reset()

# The observation dictionary contains everything the policy sees
print("Observation keys:")
for key, val in obs.items():
    if isinstance(val, np.ndarray):
        print(f"  {key:30s} shape={val.shape} dtype={val.dtype}")

# Key observations:
#   robot0_joint_pos             (7,)    - joint angles
#   robot0_joint_vel             (7,)    - joint velocities
#   robot0_eef_pos               (3,)    - end-effector position
#   robot0_eef_quat              (4,)    - end-effector orientation
#   robot0_gripper_qpos          (2,)    - gripper finger positions
#   agentview_image              (224, 224, 3) - camera image
#   cube_pos                     (3,)    - object position (privileged)

# Action space for OSC_POSE: [dx, dy, dz, dax, day, daz, gripper]
print(f"\nAction dimension: {env.action_dim}")  # 7
action = np.zeros(env.action_dim)
action[2] = -0.5   # move down
action[6] = 1.0    # close gripper
obs, reward, done, info = env.step(action)
print(f"Reward: {reward:.4f}")
env.close()

Domain randomization example

python

import numpy as np

class DomainRandomizer:
    """Randomize simulation parameters for sim-to-real transfer."""

    def __init__(self, env, seed=42):
        self.env = env
        self.rng = np.random.RandomState(seed)

    def randomize(self):
        """Apply randomization before each episode."""
        # Visual randomization
        self.env.model.light_pos[:] = self.rng.uniform(
            low=[-2, -2, 1], high=[2, 2, 3]
        )
        self.env.model.light_diffuse[:] = self.rng.uniform(0.3, 1.0, size=3)

        # Physics randomization
        for geom_id in range(self.env.model.ngeom):
            # Randomize friction: (sliding, torsional, rolling)
            nominal_friction = np.array([1.0, 0.005, 0.0001])
            scale = self.rng.uniform(0.5, 1.5)
            self.env.model.geom_friction[geom_id] = nominal_friction * scale

        # Randomize object mass (within +/- 30%)
        for body_id in range(self.env.model.nbody):
            if self.env.model.body_mass[body_id] > 0.01:  # skip ground
                nominal = self.env.model.body_mass[body_id]
                self.env.model.body_mass[body_id] = nominal * self.rng.uniform(0.7, 1.3)

        # Randomize action delay (0-2 timesteps)
        self.action_delay = self.rng.randint(0, 3)

        # Randomize observation noise
        self.obs_noise_std = self.rng.uniform(0.0, 0.02)

    def add_observation_noise(self, obs):
        """Add Gaussian noise to proprioceptive observations."""
        return obs + self.rng.normal(0, self.obs_noise_std, size=obs.shape)

# Usage in training loop
randomizer = DomainRandomizer(env)
for episode in range(num_episodes):
    randomizer.randomize()  # new randomization each episode
    obs = env.reset()
    obs = randomizer.add_observation_noise(obs)
    # ... run policy ...

MDP formalization in code

python

from dataclasses import dataclass
from typing import Tuple
import numpy as np

@dataclass
class MDPComponents:
    """Concrete example of MDP components for a reaching task."""

    # State space: joint angles (7) + joint velocities (7) + target position (3)
    state_dim: int = 17

    # Action space: joint velocity commands (7)
    action_dim: int = 7

    # Discount factor
    gamma: float = 0.99

    def reward(self, state: np.ndarray, action: np.ndarray) -> float:
        """Reward = negative distance to target - action penalty."""
        eef_pos = self.forward_kinematics(state[:7])  # simplified
        target_pos = state[14:17]
        distance = np.linalg.norm(eef_pos - target_pos)

        # Dense reward: closer is better, small action penalty
        return -distance - 0.01 * np.linalg.norm(action)

    def is_terminal(self, state: np.ndarray) -> bool:
        """Episode ends when gripper is close to target."""
        eef_pos = self.forward_kinematics(state[:7])
        target_pos = state[14:17]
        return np.linalg.norm(eef_pos - target_pos) < 0.02  # 2cm threshold

    def forward_kinematics(self, joint_angles: np.ndarray) -> np.ndarray:
        """Compute end-effector position from joint angles (simplified)."""
        # In practice, this uses the robot's kinematic chain (DH parameters)
        # Here we use a placeholder
        link_lengths = np.array([0.333, 0.316, 0.0825, 0.384, 0.0, 0.088, 0.107])
        pos = np.zeros(3)
        angle_sum = 0.0
        for i, (l, q) in enumerate(zip(link_lengths, joint_angles)):
            angle_sum += q
            pos[0] += l * np.cos(angle_sum)
            pos[2] += l * np.sin(angle_sum)
        return pos

# The optimal policy maximizes: J(pi) = E[sum_{t=0}^T gamma^t * R(s_t, a_t)]
mdp = MDPComponents()
print(f"State dim: {mdp.state_dim}, Action dim: {mdp.action_dim}")
print(f"Discount factor: {mdp.gamma}")
print(f"Effective horizon: ~{int(1/(1-mdp.gamma))} steps")

References

Seminal papers and key works referenced in this article.

Sutton & Barto. "Reinforcement Learning: An Introduction." MIT Press, 2018.
Todorov et al. "MuJoCo: A physics engine for model-based control." IROS, 2012.
Makoviychuk et al. "Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning." NeurIPS, 2021. arXiv
Tobin et al. "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." IROS, 2017. arXiv
Levine et al. "Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection." IJRR, 2018. arXiv