Planning & Reasoning — Engineermaxxing

Introduction

Consider the instruction: "Make me a coffee." A human understands this involves dozens of substeps — locating the mug, checking if it's clean, finding the coffee machine, filling the water reservoir, inserting a pod, placing the mug, pressing the brew button, waiting, and delivering the result. Each substep requires perception, reasoning about physical constraints, and motor execution. The planning problem in robotics is about bridging this gap between a high-level intent expressed in natural language and the low-level joint torques that make it happen.

For decades, robot planning meant STRIPS-style symbolic planners operating over hand-defined predicates: on(cup, table), gripper_empty(), reachable(cup). These systems were brittle. They required exhaustive specification of every object, predicate, and action schema. They couldn't handle novel objects or ambiguous instructions. And they failed catastrophically when the world didn't match their symbolic model.

The emergence of large language models changed this calculus fundamentally. LLMs encode vast knowledge about the world: what objects are used for, how tasks are typically structured, what steps come before and after others. The critical insight of recent work is that this knowledge, while not grounded in physical reality, can be combined with perceptual and motor primitives to produce remarkably capable robot planners — without hand-engineering a single predicate.

This article traces the arc from SayCan's affordance-grounded scoring through inner monologue's closed-loop feedback, code-as-policy's programmatic approach, and into world models and chain-of-thought reasoning. Each method represents a different answer to the same question: how do we get the knowledge inside an LLM outside and into the physical world?

ℹ What this article covers

Task decomposition and hierarchical planning, SayCan's affordance scoring, inner monologue for closed-loop replanning, code-as-policy and spatial reasoning through generated programs, Voyager and DEPS for open-ended exploration, VoxPoser's 3D value maps, world models and video prediction for planning, chain-of-thought reasoning in robot policies (RT-2), and the key limitations of LLM-driven planning systems. With interactive visualizations and code examples.

Task Decomposition

Hierarchical planning

The fundamental challenge in long-horizon manipulation is combinatorial explosion. If a robot has 100 primitive actions and must plan 20 steps ahead, the search space is 100²⁰ — far beyond what any planner can enumerate. Hierarchical planning addresses this by decomposing problems into multiple levels of abstraction.

Classical hierarchical task networks (HTNs) decompose abstract tasks into concrete subtasks using hand-authored decomposition rules. A task like serve_coffee decomposes into find_mug → pick_mug → place_under_spout → press_brew → deliver_mug. Each of these can further decompose: pick_mug becomes navigate_to_mug → open_gripper → move_to_grasp_pose → close_gripper → lift.

The problem is specification. Defining decomposition rules for every possible task in every possible environment is a Sisyphean labor. Every new object, every new task, every new environment requires new rules. This is where LLMs offer a qualitative leap: they can perform task decomposition zero-shot, drawing on the implicit procedural knowledge encoded in their training data.

LLMs as planners

Huang et al. (2022, "Language Models as Zero-Shot Planners") demonstrated that large language models can decompose high-level instructions into ordered sequences of plausible substeps with no task-specific training. Given the instruction "Make breakfast," GPT-3 produces:

Go to the kitchen
Open the refrigerator
Take out the eggs
Close the refrigerator
Turn on the stove
Get a pan from the cabinet
Put the pan on the stove
Crack the eggs into the pan

This is remarkable: the model has never operated a kitchen, yet it produces a plausible, correctly-ordered plan. But "plausible" is not "executable." The LLM doesn't know whether the kitchen actually has eggs, whether the stove is gas or electric, or whether the robot's gripper can crack an egg. This gap between linguistic plausibility and physical feasibility is the central tension in LLM-based robot planning.

💡 The grounding problem

LLMs are trained on internet text, not robot experience. They know that "pick up the cup" is a reasonable action, but they do not know whether the cup is within reach, whether the gripper can fit around it, or whether the cup is too heavy. Every approach in this article is, at its core, a different strategy for grounding LLM knowledge in physical reality — through affordance models, perception feedback, generated code, or learned dynamics.

Task Decomposition Tree Interactive

Click on different high-level instructions to see how an LLM decomposes them into hierarchical subtask trees. Each level adds physical detail.

3 levels of decomposition — 8 leaf actions

SayCan: Grounding Language in Affordances

SayCan (Ahn et al., 2022) is the foundational work on grounding LLM planning in physical affordances. The key insight is elegant: an LLM knows what actions are useful for a task (semantic knowledge), while a learned value function knows what actions are possible in the current state (physical knowledge). Combining both yields actions that are both useful and feasible.

Affordance grounding

The term "affordance" comes from ecological psychology (Gibson, 1979): an affordance is what the environment offers the agent. A flat surface affords placing; a handle affords grasping; an open drawer affords reaching inside. In SayCan, affordances are operationalized as the probability that a low-level skill will succeed in the current state.

SayCan maintains a library of 551 short-horizon manipulation skills, each trained with reinforcement learning in the real world (using a fleet of mobile manipulators at Everyday Robots). Each skill is described by a natural language label: "pick up the red bull can," "go to the counter," "place on the table." For each skill, there is a corresponding affordance function (a learned value function) that estimates the probability of successful execution from the current state.

The scoring mechanism

Given a natural language instruction (e.g., "I spilled my drink, can you help?"), SayCan scores each candidate action a_i by combining two probabilities:

score(a i) = P LLM (a i | instruction, history) \times P affordance (a i | state) where: P LLM = language model probability that a i is a useful next step P affordance = value function estimate that a i will succeed

The LLM provides P_LLM by scoring the log-likelihood of each skill's text description as a continuation of the prompt. The prompt includes the instruction and the history of actions taken so far. For example, after "I spilled my drink, can you help?" the LLM assigns high probability to "find a sponge" and low probability to "pick up the apple."

The affordance model provides P_affordance by evaluating the current observation with each skill's value function. If the sponge is visible and reachable, P_affordance("find a sponge") is high. If the sponge is in another room, it's low — even though the LLM rates it as semantically useful.

The action with the highest combined score is selected and executed. After execution, the new state is observed, the action is appended to the history, and the process repeats until the LLM assigns high probability to a termination token.

Candidate Action	P_LLM	P_affordance	Combined Score
Find a sponge	0.38	0.85	0.323
Pick up the towel	0.30	0.72	0.216
Go to the counter	0.12	0.91	0.109
Pick up the apple	0.02	0.88	0.018
Open the drawer	0.04	0.15	0.006

The table above illustrates the critical contribution of affordance grounding. Without affordances, the LLM might select actions that are semantically reasonable but physically impossible. With affordances, the system selects the best action the robot can actually execute. SayCan achieved 84% end-to-end planning success on long-horizon tasks in a real kitchen, compared to 14% for an LLM planner without affordance grounding.

💡 Why multiplication, not addition?

The multiplicative combination is crucial. If the LLM says an action is useful (P_LLM = 0.9) but the affordance model says it will fail (P_affordance = 0.01), the combined score is 0.009 — effectively vetoed. Addition would give 0.91, potentially still selecting the infeasible action. Multiplication ensures that both relevance and feasibility must be present. This can also be interpreted as joint probability under a conditional independence assumption.

SayCan Affordance Scoring Interactive

Adjust LLM probability (semantic relevance) and affordance probability (physical feasibility) to see how their product selects grounded actions. Hover over bars to inspect values.

Scenario: spill cleanup — hover bars to inspect

Inner Monologue: Closed-loop Language Feedback

Closing the loop with language

SayCan plans in an open-loop fashion: after selecting and executing an action, it re-scores all candidates but has no mechanism for detecting whether the previous action succeeded or what went wrong if it failed. Inner Monologue (Huang et al., 2022) closes this loop by incorporating language-based feedback from multiple perception modules.

The architecture adds three feedback channels to the LLM planner:

Scene description: an object detector or VLM describes what is currently visible ("I see a sponge on the counter, a red cup on the table, and a puddle on the floor").
Success detection: after each action, a classifier reports whether it succeeded ("The robot successfully picked up the sponge" or "The pick attempt failed — the sponge was not grasped").
Human feedback: optionally, a human can provide corrective language ("No, I meant the blue cup, not the red one").

All feedback is injected into the LLM prompt as text, maintaining a running dialogue:

text

# Inner Monologue prompt trace (simplified)

User: "I spilled my drink, can you clean it up?"
Scene: I see a red cup tipped over on the table, a puddle
       on the floor, a sponge on the counter, paper towels
       next to the sink.
Robot plan: 1. Pick up the sponge.
Robot action: Attempting to pick up the sponge...
Success detector: FAILED. Gripper did not close on object.
Scene: The sponge is still on the counter. The gripper
       appears to have missed by ~3cm to the left.
Robot plan: 1. Pick up the sponge. (retry, adjust position)
Robot action: Attempting to pick up the sponge...
Success detector: SUCCESS. Sponge is in gripper.
Robot plan: 2. Go to the puddle on the floor.
Robot action: Navigating to the puddle...

Failure recovery through reasoning

The critical advantage of inner monologue is replanning after failure. Without language feedback, a robot that fails to grasp a sponge has limited recovery options — it might retry the same failed motion or continue to the next step without the sponge. With language feedback, the LLM can reason about why the failure occurred and adjust:

If a grasp fails, the scene description reveals the object's updated position, enabling a corrected grasp attempt.
If the target object is missing ("I don't see a sponge"), the LLM can substitute ("Use paper towels instead").
If a human corrects ("Not that cup"), the LLM can revise the entire plan.

Inner Monologue improved long-horizon task success from 46% (open-loop SayCan) to 66% in kitchen manipulation tasks, with most of the gain coming from failure recovery. The system could handle perturbations that would be catastrophic for open-loop planners: objects moved during execution, failed grasps, and ambiguous instructions resolved through dialogue.

ℹ Language as a universal interface

A deep insight from Inner Monologue is that language serves as a universal interface between heterogeneous modules. Object detectors, success classifiers, human operators, and the LLM planner all communicate through text. This avoids the integration nightmare of connecting modules with incompatible representations. Any perception module that can produce a text description can participate in the reasoning loop.

Code-as-Policy

Spatial reasoning through code

Code-as-Policies (Liang et al., 2023) takes a fundamentally different approach to LLM-based robot control. Instead of having the LLM select from a fixed library of skills, it generates executable Python code that composes perception API calls with control commands to solve novel tasks.

The key insight is that code is a far more expressive output format than natural language plans. Code supports variables, loops, conditionals, arithmetic, and function composition. This makes it natural to express spatial reasoning, quantitative constraints, and iterative behaviors that are awkward or impossible to specify as text plans.

Consider the instruction: "Put the red block to the left of the blue block." A text planner might generate "pick up red block, place it left of blue block" — but this doesn't specify how far left, how to compute the target position, or how to handle edge cases. Code-as-Policy generates:

python

# LLM-generated code for: "Put the red block to the left of the blue block"

blue_pos = detect_object("blue block")     # returns (x, y, z)
target_pos = (blue_pos[0] - 0.10,          # 10cm to the left
              blue_pos[1],                   # same y
              blue_pos[2])                   # same height

pick("red block")
place(target_pos)

The LLM is provided with an API specification — functions like detect_object(), pick(), place(), get_position(), move_to() — and a set of in-context examples. Given a new instruction, it generates a program that composes these primitives. The generated code is executed directly on the robot.

Code-as-Policy can express behaviors that are extremely difficult for text planners:

Spatial relations: "Place the forks evenly spaced between the plates" requires computing positions arithmetically from the plate locations.
Iteration: "Stack all the blocks by size" requires sorting detected objects and looping through them.
Conditionals: "If there's a cup on the table, bring it to me; otherwise, get one from the cabinet" requires branching based on perception results.
Numerical reasoning: "Move 15cm to the right" requires understanding metric units and translating them to coordinates.

Instruction → LLM + API spec → Python code → Execution → Feedback

Code-as-Policy Pipeline Interactive

Step through the code-as-policy pipeline: from instruction to LLM code generation to execution and feedback. Click each stage to see details.

Stage 1/5: Natural language instruction received

Voyager and DEPS: open-ended exploration

Voyager (Wang et al., 2023) extends code-as-policy into open-ended exploration in Minecraft. It uses GPT-4 to propose exploration goals, generate code to accomplish them, and build a persistent skill library of verified programs that can be retrieved and composed for future tasks. This creates a curriculum of escalating complexity: the agent first learns to punch trees, then craft planks, then build tools, then mine ores — each new skill building on previously mastered ones.

Three components make Voyager work:

Automatic curriculum: the LLM proposes the next exploration task based on the agent's current inventory, discovered biomes, and skill library. It avoids tasks that are too easy (already solved) or too hard (prerequisites not met).
Skill library: successfully executed code is indexed by its description and stored for retrieval. When a new task resembles a previous one, the relevant skill is retrieved and used as a starting point or subroutine.
Iterative refinement: if generated code fails (runtime error or task not completed), the error message and environment state are fed back to the LLM, which produces a corrected version. This self-debugging loop is critical for complex behaviors.

DEPS (Wang et al., 2023) introduces "Describe, Explain, Plan, Select" — a structured prompting framework that improves LLM planning by requiring the model to first describe the current state, explain what has been tried, plan candidate next steps, and select the best one. This structured approach reduces hallucinated plans and improves coherence over long horizons.

VoxPoser: 3D Value Maps from Language

VoxPoser (Huang et al., 2023) tackles a limitation of both SayCan and code-as-policy: neither produces dense spatial information about where the robot should move in 3D space. SayCan selects from discrete skills; code-as-policy computes point targets. Neither generates the rich spatial cost functions that motion planners need.

VoxPoser uses LLMs (and VLMs) to compose 3D voxel value maps that define where the robot's end-effector should go and what it should avoid. Given the instruction "Open the top drawer," VoxPoser generates:

An affordance map: high values at the drawer handle, indicating where to grasp.
An avoidance map: high costs near the vase on top of the dresser, indicating what not to hit.
A rotation map: specifying the gripper orientation for a pulling motion.

These maps are composed by the LLM generating code that calls perception APIs (open-vocabulary object detectors like OWL-ViT, depth estimation) and constructs 3D cost volumes. A motion planner (model-predictive control) then optimizes a trajectory through the resulting value landscape.

V total (x, y, z) = V affordance (x, y, z) - λ \cdot V avoidance (x, y, z) Trajectory: τ* = argmin τ \sum t -V total (τ t) + α||τ t - τ t-1 || 2

The key advantage is zero-shot generalization. Because value maps are generated from language (not learned from demonstrations of the specific task), VoxPoser can handle novel instructions and novel objects without additional training. It achieved competitive performance on RLBench tasks with zero demonstrations, while demonstration-based methods required 100+ examples per task.

💡 From discrete to continuous

VoxPoser represents a key transition in LLM-based planning: from selecting among discrete actions (SayCan) to generating continuous spatial objectives (value maps). This enables handling tasks that require precise spatial reasoning — like navigating around obstacles or approaching objects from specific angles — that discrete action selection cannot express. The cost is computational: generating and optimizing over dense 3D volumes is significantly more expensive than scoring a list of candidate actions.

World Models for Robotics

Learned dynamics models

Planning requires prediction: to choose an action, the robot must anticipate its consequences. Classical model-based approaches use analytical dynamics equations (rigid-body physics, contact models), but these fail for deformable objects, liquids, granular materials, and the complex contact interactions common in manipulation.

Learned world models replace analytical dynamics with neural networks trained to predict state transitions: given the current state s_t and action a_t, predict the next state s_t+1. The model can then be used for planning by simulating action sequences forward and evaluating which sequence best achieves the goal.

World model: ŝ t+1 = f θ (s t, a t) Planning: a* 0:H = argmax a 0:H \sum t=0 H R(ŝ t, a t) subject to ŝ t+1 = f θ (ŝ t, a t)

The state representation matters enormously. Early work used low-dimensional state vectors (object positions, orientations). More recent approaches operate directly on images or latent representations:

Approach	State Repr.	Strengths	Challenges
Analytic dynamics	Object poses	Exact for rigid bodies	Fails on deformables, contacts
Latent dynamics (Dreamer)	Learned latent	Compact, differentiable	Hard to interpret, train
Video prediction	Pixel space	Rich, general	Expensive, compounding error
3D scene graphs	Object graph	Compositional	Requires perception pipeline

Video prediction as world modeling

UniSim (Yang et al., 2023) and Genie (Bruce et al., 2024) represent a paradigm shift: using large-scale video generation models as world simulators. UniSim is a diffusion model that generates future video frames conditioned on the current frame and a specified action or text description. The model learns physics implicitly from internet video — objects fall downward, liquids flow, rigid objects maintain shape.

Genie takes this further by learning an action-controllable world model from unlabeled internet video. It infers a latent action space from video sequences (without action labels), enabling interactive generation: a user or policy can "play" the generated world by specifying actions at each step. This opens the possibility of training robot policies in generated environments rather than physical or simulated ones.

The promise of video world models for planning is immense. Instead of hand-building a simulator, the robot can "imagine" the consequences of its actions by generating future video frames and evaluating whether the imagined future matches the goal. Model-based RL with video prediction has shown early success: Dreamer (Hafner et al., 2020, 2023) learns latent dynamics from pixel observations and plans in latent space, achieving strong performance on continuous control tasks with high-dimensional observations.

ℹ The compounding error problem

Learned dynamics models accumulate prediction error over time. A 1% position error per step becomes 20% error after 20 steps. This limits the effective planning horizon. Mitigation strategies include: replanning at every step (model-predictive control), learning in latent space where errors are more contained (Dreamer), using the model only for short-horizon lookahead combined with a value function for long-horizon estimation (Dyna-style), and training with noise injection to improve robustness to distributional shift.

Chain-of-Thought for Robot Actions

RT-2 (Brohan et al., 2023) demonstrated something unexpected: when a VLM is fine-tuned on robot data, it develops emergent reasoning capabilities that were not explicitly trained. Asked to "pick up the object that is not a fruit," RT-2 can examine a scene with an apple, a banana, and a water bottle, reason that the water bottle is not a fruit, and execute the grasp — despite never being trained on this kind of logical inference in a robot context.

This emergent reasoning comes from the VLM backbone (PaLI-X, with 55B parameters). The model retains its language reasoning abilities even after fine-tuning on robot trajectories. Chain-of-thought prompting further amplifies this: by prefixing the action output with a reasoning trace, the model is encouraged to "think" before acting.

text

# RT-2 chain-of-thought example (simplified)

Instruction: "Move the coke can to the correct bin"
[Scene: a coke can, a recycling bin, a trash bin]

Chain of thought:
  "The object is a coke can, which is made of aluminum.
   Aluminum is recyclable. The correct bin is the
   recycling bin, which is on the right."

Action tokens: [x: 0.73, y: 0.45, z: 0.12, rx: 0, ry: 0,
                rz: 0.1, gripper: close] → navigate → place

The implications are significant. Traditional robot learning separates perception, reasoning, and action into distinct modules with hand-designed interfaces. RT-2 collapses these into a single forward pass of a VLM. The model perceives the scene, reasons about semantics and physics, and produces motor commands — all as token prediction.

However, chain-of-thought for robots remains limited. RT-2's reasoning is shallow compared to what LLMs can do in pure text. The model struggles with multi-step logical chains, quantitative reasoning (counting objects, estimating distances), and temporal reasoning (what will happen if I do X before Y). These limitations motivate the hybrid approaches described earlier in this article, where specialized LLMs handle planning while VLAs handle execution.

Integrated (RT-2 style)

Single model, end-to-end

One VLM handles perception, reasoning, and action. Simpler architecture, but reasoning depth is limited by the model and training data.

Modular (SayCan / Code-as-Policy style)

Specialized modules, language interface

Separate LLM planner, perception system, and low-level controller. Stronger reasoning, but requires integration engineering and adds latency.

Planning Horizon Comparison Interactive

Compare reactive, hierarchical, and LLM-based planning across different time horizons and task complexity. Toggle between approaches to see tradeoffs.

Reactive: fast response, no lookahead, single-step horizon

Limitations and Open Problems

LLM-based planning for robots has made remarkable progress, but significant limitations remain. Understanding these is essential for assessing where the field stands and where it needs to go.

Hallucination in planning

LLMs can generate plausible-sounding plans that are physically impossible. "Pick up the table and place it in the drawer" is syntactically valid and follows the pick-place pattern, but violates basic physical constraints. More subtle hallucinations are harder to catch: "Slide the glass across the table to the other person" might damage the glass, spill its contents, or fail on a high-friction surface. Affordance grounding (SayCan) mitigates this for known skills, but novel compositions remain vulnerable.

Latency and real-time control

LLM inference takes 0.5–5 seconds per query, depending on model size and hardware. This is acceptable for high-level planning (selecting the next task) but far too slow for reactive control (adjusting grip force during a grasp). Current systems handle this with a two-tier architecture: slow LLM planning at the task level, fast neural policy execution at the control level. But this creates a disconnect — the planner cannot react to events faster than its inference rate.

Planning Approach	Latency	Horizon	Generalization	Grounding
SayCan	~1s per step	Medium (5–15 steps)	Limited to skill library	Strong (value functions)
Inner Monologue	~2s per step	Medium (5–15 steps)	Better (replanning)	Strong (perception feedback)
Code-as-Policy	~3s for generation	Long (arbitrary programs)	Strong (novel compositions)	Moderate (depends on APIs)
VoxPoser	~5s for map generation	Short (single actions)	Strong (zero-shot)	Strong (3D value maps)
RT-2 (end-to-end)	~0.2s per action	Short (1–3 steps)	Moderate	Implicit (learned)

Grounding failures

Even with affordance models, LLM planners can fail to ground language in the correct physical referents. "The cup next to the plate" might be ambiguous when multiple cups are near multiple plates. Spatial language ("to the left of," "behind," "between") is particularly challenging because it depends on the frame of reference. Code-as-policy partially addresses this with explicit coordinate computation, but requires accurate perception to begin with.

Open problems

Continuous improvement: current systems don't learn from their planning failures. A robot that discovers its plan was wrong should update its planning strategy, but LLM weights are frozen at deployment. Fine-tuning on robot experience is expensive and risks catastrophic forgetting.
Multi-agent coordination: scaling LLM planning to multiple robots cooperating on a shared task (e.g., two arms assembling furniture) remains largely unexplored.
Safety and reliability: a hallucinated plan in a kitchen can break dishes. In industrial or medical settings, the stakes are far higher. Formal verification of LLM-generated plans is an open research direction.
Efficiency: current approaches require cloud-scale LLMs (70B+ parameters) for strong planning performance. Running these on-robot at low latency requires significant hardware or distillation into smaller, specialized models.

Code Examples

SayCan-style affordance scoring

python

import numpy as np

class SayCanPlanner:
    """Simplified SayCan: LLM scoring × affordance scoring."""

    def __init__(self, skills, llm, affordance_model):
        self.skills = skills          # list of (name, policy, value_fn)
        self.llm = llm                # language model for scoring
        self.affordance = affordance_model

    def score_actions(self, instruction, history, observation):
        """Score all candidate actions and return the best one."""
        scores = []
        for name, policy, value_fn in self.skills:
            # P(action | instruction, history) from LLM
            prompt = self._build_prompt(instruction, history, name)
            p_llm = self.llm.score(prompt)  # log-likelihood

            # P(success | state, action) from affordance model
            p_afford = value_fn(observation)  # value in [0, 1]

            combined = np.exp(p_llm) * p_afford
            scores.append((name, combined, p_llm, p_afford))

        scores.sort(key=lambda x: x[1], reverse=True)
        return scores

    def plan(self, instruction, get_observation, max_steps=10):
        """Execute a full SayCan planning loop."""
        history = []
        for step in range(max_steps):
            obs = get_observation()
            scores = self.score_actions(instruction, history, obs)
            best_action = scores[0][0]

            if best_action == "done":
                break

            # Execute the selected skill
            policy = self._get_policy(best_action)
            policy.execute()
            history.append(best_action)

        return history

    def _build_prompt(self, instruction, history, candidate):
        steps = "\n".join(f"{i+1}. {h}" for i, h in enumerate(history))
        return (f"Task: {instruction}\n"
                f"Steps so far:\n{steps}\n"
                f"Next step: {candidate}")

    def _get_policy(self, name):
        for n, p, _ in self.skills:
            if n == name:
                return p
        raise ValueError(f"Unknown skill: {name}")

Code-as-policy template

python

"""Code-as-Policy: LLM generates executable robot programs."""

# ── Robot perception/control API (provided to the LLM) ──────
def detect_object(description: str) -> tuple[float, float, float]:
    """Detect object matching description. Returns (x, y, z) position."""
    ...

def pick(target: str | tuple) -> bool:
    """Pick up object by name or at (x, y, z) position."""
    ...

def place(position: tuple[float, float, float]) -> bool:
    """Place held object at (x, y, z) position."""
    ...

def get_objects_on(surface: str) -> list[dict]:
    """Get all objects on a surface with names and positions."""
    ...

def say(message: str):
    """Speak to the user."""
    ...

# ── Example: LLM-generated code for complex tasks ────────────

# Instruction: "Sort the blocks by color into the matching bowls"
blocks = get_objects_on("table")
bowls = {
    "red": detect_object("red bowl"),
    "blue": detect_object("blue bowl"),
    "green": detect_object("green bowl"),
}

for block in blocks:
    color = block["color"]
    if color in bowls:
        pick(block["name"])
        place(bowls[color])
    else:
        say(f"No bowl found for {color} block, skipping.")

# Instruction: "Place cups in a line, 10cm apart, starting from the left"
cups = get_objects_on("table")
cups_only = [c for c in cups if "cup" in c["name"].lower()]
start_x = 0.2  # left side of workspace
y, z = 0.5, 0.02  # fixed y, z

for i, cup in enumerate(cups_only):
    target = (start_x + i * 0.10, y, z)
    pick(cup["name"])
    place(target)

Task decomposition with LLM

python

"""Hierarchical task decomposition using an LLM."""

from openai import OpenAI

client = OpenAI()

DECOMPOSE_PROMPT = """You are a robot task planner. Given a high-level task,
decompose it into a sequence of low-level actions the robot can execute.

Available primitives:
- navigate_to(location)
- pick(object)
- place(object, location)
- open(container)
- close(container)
- pour(source, target)
- press(button)
- wait(seconds)
- check(condition) -> bool

Output a numbered list of primitive actions. Be specific about objects
and locations. If a step requires a check, use if/else branching.

Task: {task}
Scene: {scene}
"""

def decompose_task(task: str, scene: str) -> list[str]:
    """Use LLM to decompose a high-level task into primitives."""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": DECOMPOSE_PROMPT.format(task=task, scene=scene)
        }],
        temperature=0.2,
        max_tokens=500
    )
    plan_text = response.choices[0].message.content
    # Parse numbered list into action strings
    steps = []
    for line in plan_text.strip().split("\n"):
        line = line.strip()
        if line and line[0].isdigit():
            # Remove number prefix: "1. navigate_to(...)" -> "navigate_to(...)"
            action = line.split(".", 1)[1].strip()
            steps.append(action)
    return steps

# Example usage
task = "Make a cup of coffee and bring it to the living room"
scene = ("Kitchen: coffee machine (off) on counter, clean mug in cabinet, "
         "coffee pods in drawer, sugar on counter. Living room: table, couch.")

plan = decompose_task(task, scene)
for i, step in enumerate(plan, 1):
    print(f"  {i}. {step}")

# Expected output:
#   1. navigate_to(kitchen_cabinet)
#   2. open(cabinet)
#   3. pick(clean_mug)
#   4. close(cabinet)
#   5. place(clean_mug, coffee_machine)
#   6. open(pod_drawer)
#   7. pick(coffee_pod)
#   8. close(pod_drawer)
#   9. place(coffee_pod, coffee_machine)
#  10. press(brew_button)
#  11. wait(45)
#  12. pick(mug_with_coffee)
#  13. navigate_to(living_room_table)
#  14. place(mug_with_coffee, living_room_table)

References

Seminal papers and key works referenced in this article.

Ahn et al. "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." CoRL, 2022. arXiv
Huang et al. "Inner Monologue: Embodied Reasoning through Planning with Language Models." CoRL, 2022. arXiv
Liang et al. "Code as Policies: Language Model Programs for Embodied Control." ICRA, 2023. arXiv
Huang et al. "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models." CoRL, 2023. arXiv
Hafner et al. "Mastering Diverse Domains through World Models." 2023. arXiv