SayCan — Veanors

Chapter 0: The Problem

You walk into your kitchen and say: "I just spilled my drink, can you help?" A large language model responds: "You could try using a vacuum cleaner." Reasonable advice for a human, perhaps. Completely useless for a robot that has no vacuum, cannot leave the kitchen, and can only pick, place, and navigate.

This is the grounding problem. LLMs absorb enormous knowledge about everyday tasks from web text — how to cook, clean, organize — but they have never done any of it. They have never opened a drawer, felt the weight of a sponge, or checked whether an apple is within arm’s reach. Their plans are written for an idealized agent with unlimited capabilities in an infinite environment.

Meanwhile, robots are the opposite. A mobile manipulator in a kitchen can pick up a sponge, navigate to the counter, and place objects in the trash with high reliability. It knows exactly what it can do. But if you tell it "I spilled my drink, can you help?" in raw natural language, it has no idea how to decompose that into its known skills.

The fundamental mismatch: LLMs know what to do but not what a robot can do. Robots know what they can do but not what to do. SayCan fuses both: the LLM provides task knowledge ("Say") and learned affordance functions provide feasibility knowledge ("Can").

Before SayCan, there were two dead ends. You could feed the raw instruction into a language-conditioned policy — but these policies only understand short, primitive commands like "pick up the can," not abstract requests like "help me clean up." Or you could use the LLM to generate a step-by-step plan — but without grounding, it might suggest skills the robot doesn’t have, objects that aren’t present, or actions that are physically impossible from the current state.

Why can't an LLM alone produce reliable plans for a specific robot in a specific environment?

The LLM has never interacted with the physical world — it doesn't know the robot's capabilities, the objects present, or which actions are feasible from the current state LLMs can only generate code, not natural language plans LLMs are too slow for real-time robotics

Chapter 1: The Key Insight

SayCan’s insight is beautifully simple. Instead of asking the LLM to generate a plan (which might hallucinate impossible actions), ask it to score a fixed set of skills the robot already knows. Then multiply that score by a second score from the robot’s own value functions that captures "can I actually do this right now?"

score(π) = p(ℓ_π | i) × p(c_π | s, ℓ_π)

Where:

p(ℓ_π | i) — the LLM probability that skill π with language description ℓ_π is a useful next step for instruction i. This is the "Say" — task grounding.
p(c_π | s, ℓ_π) — the value function probability that skill π will complete successfully from state s. This is the "Can" — world grounding.

The robot selects the skill with the highest combined score, executes it, appends it to the LLM context, and repeats until a termination token is selected.

Why multiplication? Because we want both conditions to be true simultaneously. A skill that is useful but infeasible (high Say, low Can) gets suppressed. A skill that is feasible but irrelevant (low Say, high Can) also gets suppressed. Only skills that are both useful and doable rise to the top. This is a probabilistic AND — the product of two independent probabilities.

This factorization has a clean probabilistic interpretation. We want p(c_i | i, s, ℓ_π) — the probability that executing skill π successfully makes progress toward completing instruction i from state s. Assuming a skill that fails makes zero progress and a skill that succeeds makes progress with probability p(ℓ_π | i), we get the factorization above.

Say × Can Scoring

Each bar shows LLM score (Say), value function score (Can), and their product. The highest combined score wins. Drag the sliders to change the scenario.

Scenario

In the SayCan scoring formula, what does multiplying the LLM probability by the value function probability achieve?

It ensures the selected skill is both semantically useful for the task AND physically feasible in the current state It makes the LLM run faster It trains the value function end-to-end

Chapter 2: LLMs for Planning

How do you get an LLM to break "I spilled my drink, can you help?" into a sequence of robot primitives? SayCan uses two techniques: prompt engineering and scoring mode.

Prompt engineering

The LLM is prompted with examples of a dialog between a user and a robot. The prompt shows the format: the user gives a high-level instruction, and the robot responds with "I would: 1. [skill], 2. [skill], ... done." This few-shot structure teaches the LLM the expected output format without any fine-tuning.

Scoring mode vs. generation mode

A critical design choice: instead of letting the LLM generate arbitrary text (which might produce skills the robot doesn’t have), SayCan uses the LLM in scoring mode. Given the instruction and dialog history so far, the LLM evaluates every skill description in the robot’s repertoire and assigns a probability to each one. This constrains the output to valid skills by construction.

Scoring vs. generative: The paper compares scoring mode against a generative baseline that freely generates text and then projects it to the nearest skill via cosine similarity with USE embeddings. The generative approach achieves 74% planning success vs. SayCan’s 84%. The gap comes from cases where the generated text doesn’t cleanly map to any available skill.

Iterative planning

SayCan doesn’t generate the full plan at once. It selects one skill, executes it, appends it to the dialog, and queries the LLM again. This means each step conditions on the current state of the world (via updated affordances) and the history of what’s been done (via the growing dialog). The LLM effectively maintains a chain of thought through the dialog context.

LLM Scoring Over Skills

The LLM scores every skill in the robot’s repertoire. Each bar shows p(ℓ_π | instruction). Click skills to "execute" them and see how the distribution shifts.

Step 0: awaiting instruction

Why does SayCan use the LLM in scoring mode rather than generation mode?

Scoring mode constrains the output to valid skills the robot actually has, while generation mode can produce arbitrary text that may not map to any available skill Generation mode is too slow Scoring mode requires less GPU memory

Chapter 3: Affordance Functions

The word "affordance" comes from ecological psychology — it means what the environment affords (allows) an agent to do. A cup affords grasping. An empty table affords placing. A closed drawer affords opening. SayCan operationalizes this concept through value functions.

Value functions as affordances

For each skill π with language description ℓ_π, a value function V(s, ℓ_π) is trained to predict the probability that the skill will complete successfully from state s. The training uses sparse rewards: +1 if the skill succeeds, 0 otherwise. In the undiscounted, sparse reward setting, this value function directly equals the success probability:

p(c_π | s, ℓ_π) = V(s, ℓ_π) ≈ P(success | current state, skill description)

What makes this work

The value function is state-conditioned. It takes in the robot’s current camera image and the skill description, and outputs a probability. This means:

"Pick up the apple" has a high value when an apple is visible and reachable, and a low value when the robot is in an empty hallway.
"Open the drawer" has a high value when the robot is near a closed drawer, and a low value near an already-open drawer.
"Navigate to the counter" has a high value from most states (the robot can usually drive somewhere), but a low value if the path is blocked.

The key realization: Value functions trained via RL on simple manipulation tasks naturally learn to be affordance detectors. A value function for "pick up the can" learns to recognize when a can is present, reachable, and graspable — exactly the information needed to ground LLM plans in physical reality. No separate perception module needed.

Multi-task training

Instead of training 551 separate value functions, SayCan trains a single multi-task model conditioned on the language description. The skill command is encoded via a pre-trained sentence encoder (Universal Sentence Encoder), and the resulting embedding is fed into the value network alongside the image observation. This amortizes training cost and enables generalization across similar skills.

Why do sparse-reward value functions (reward = 1 on success, 0 otherwise) naturally function as affordance detectors?

In the undiscounted sparse-reward setting, the value function directly estimates the probability that the skill will succeed from the current state — which is exactly what an affordance measures Because sparse rewards are easier to train Because the value function memorizes all possible states

Chapter 4: The SayCan Algorithm

Now let’s put it all together. The SayCan algorithm is a loop: at each step, score every skill, pick the best, execute it, update the context, and repeat.

Algorithm

Input

High-level instruction i, current state s₀, skill library Π with descriptions ℓ_Π

↓

Score (Say)

For each skill π: compute p^LLM_π = p(ℓ_π | i, history)

↓

Score (Can)

For each skill π: compute p^aff_π = p(c_π | s_n, ℓ_π) from value function

↓

Combine

p^combined_π = p^aff_π × p^LLM_π — select π_n = argmax

↓

Execute

Run skill π_n on robot, update state s_n+1

↓

Append

Add ℓ_{π_n} to dialog history, n = n + 1

↓

Repeat

Loop until "done" token is selected as the best skill

The elegance: SayCan requires no fine-tuning of the LLM. The language model is used purely as a scoring function, frozen. The value functions are trained separately via RL. At inference time, the only computation is: score each skill with the LLM, score each skill with the value function, multiply, take argmax. The plan emerges from the interaction between these two scoring systems.

SayCan Decision Loop

Watch SayCan plan step by step. At each step, the LLM scores skills (blue), value functions score feasibility (green), and the product (gold) selects the next action. Press Step to advance.

Instruction: "I spilled my drink, help?"

After executing a skill, what does SayCan do before selecting the next skill?

It appends the executed skill to the dialog history and re-queries both the LLM (with updated context) and the value functions (with the new state), so the next selection reflects what has been done and what is now feasible It fine-tunes the LLM on the new experience It retrains the value functions

Chapter 5: The Skill Library

SayCan’s robot has a repertoire of 551 skills spanning seven families, operating with 17 objects in a kitchen environment. Each skill is a short, language-described behavior with its own policy and value function.

Skill families

Pick up — "pick up the sponge," "pick up the coke can" (one per object)
Place — "put the [object] on the counter," "put the [object] in the trash"
Rearrange — "move the [object] near the [object]"
Open / Close — "open the drawer," "close the drawer"
Navigate — "go to the table," "go to the far counter"
Find — "find a sponge," "find a coke can"
Done — termination signal

Training the policies

The skill policies are trained via two methods:

Behavior Cloning (BC-Z): Image-based, language-conditioned. Trained on human demonstrations. These achieve higher task success rates.
Reinforcement Learning (MT-Opt): Trained in simulation with sim-to-real transfer via RetinaGAN. These produce the value functions needed for affordance scoring.

A crucial decoupling: The BC policies execute the skills, but the RL-trained value functions score feasibility. You can use whatever policy training method works best for execution, as long as you have value functions from RL to provide affordances. The "brain" (LLM + value functions) and the "body" (execution policies) are separate systems.

Language conditioning

Both policies and value functions receive the skill description as input via a frozen Universal Sentence Encoder (USE). The USE embedding tells the model which skill to perform. Importantly, the sentence encoder used for low-level skill conditioning is different from the LLM used for high-level planning — each language model operates at the abstraction level it’s best suited for.

Why does SayCan use BC policies for execution but RL-trained value functions for affordance scoring?

BC policies achieve higher task success rates from demonstrations, while RL value functions provide calibrated success probabilities needed for the affordance scoring — each method contributes what it does best BC is faster to train RL policies don’t work on real robots

Chapter 6: Grounding Through Affordances

The affordance grounding is what makes SayCan work in the real world. Without it, the LLM generates plausible-sounding but physically impossible plans. Let’s see exactly how grounding fixes the failure modes.

Failure mode 1: Objects not present

Instruction: "Bring me a fruit." The LLM might score "pick up the banana" highest because bananas are common fruits. But there’s no banana in the kitchen — there’s only an apple. The value function for "pick up the banana" returns ≈0 (it sees no banana), while "pick up the apple" returns ≈0.8. After multiplication, the apple wins despite lower LLM score.

Failure mode 2: Wrong state for the skill

Instruction: "Put the can on the counter." The robot doesn’t have the can yet. The LLM might score "place the can on the counter" highly, but the value function knows the gripper is empty — that skill’s affordance is near zero. Instead, "find a coke can" has high affordance (the robot can navigate), so it gets selected first.

Failure mode 3: Hallucinated capabilities

Without the skill-constrained scoring approach, a generative LLM might suggest "use a vacuum cleaner" or "call for help" — actions completely outside the robot’s capability set. SayCan eliminates this by only scoring skills the robot actually has.

Grounding in Action

Compare ungrounded LLM planning (left) vs. SayCan with affordance grounding (right). Toggle scenes to see how affordances redirect the plan to feasible actions.

Ablation result: Without value function grounding ("No VF"), planning success drops from 84% to 67%. Embodiment-specific tasks drop from 64% to just 18%. The affordance grounding is most critical when the task requires understanding the robot’s current physical state.

How does affordance grounding handle the case where the LLM’s top-scoring skill involves an object not present in the scene?

The value function returns near-zero for that skill because it can’t see the object, so the combined score drops below alternatives that involve objects actually present The LLM is retrained to avoid mentioning missing objects The robot searches the kitchen for the missing object

Chapter 7: Results

SayCan was evaluated on 101 real-world robotic tasks in two kitchen environments: a mock kitchen (training environment) and a real office kitchen (novel environment). The robot is a mobile manipulator from Everyday Robots with a 7-DOF arm and two-fingered gripper.

Headline numbers

Mock kitchen: 84% planning success, 74% execution success
Real kitchen: 81% planning success, 60% execution success

The gap between planning and execution success reflects that even correct plans can fail during physical execution (grasping failures, navigation errors, etc.).

Ablation results

The paper systematically ablates both the LLM and the affordance grounding:

No value function (LLM only): 67% planning, showing affordances add +17% absolute
Generative LLM + projection: 74% planning, better than raw LLM but worse than scoring
BC with raw instruction (no LLM): 0% on all tasks — the policy cannot parse high-level instructions
BC with USE projection (no LLM): 60% on single primitives, 0% on multi-step

Results Comparison

Planning success rates across methods. SayCan with full affordance grounding outperforms all ablations.

Error analysis

Of all failures, 65% were LLM errors (wrong skill selected) and 35% were affordance errors (value function misclassified feasibility). Common LLM failure modes include early termination (stopping before all sub-goals are met), negation errors ("not an apple" → selects apple), and ambiguous references.

Scaling with better LLMs: PaLM 540B achieves 84% planning success, while FLAN 137B achieves only 72%. This shows that a robot’s task performance can be improved simply by upgrading the underlying language model — no retraining of any robotic component required.

What percentage of SayCan’s failures are due to LLM errors vs. affordance function errors?

65% LLM errors (wrong skill selected) and 35% affordance errors (feasibility misclassified) — the language understanding is the bigger bottleneck 50% each 90% affordance errors

Chapter 8: Long-Horizon Tasks

SayCan’s most impressive demonstrations are on long-horizon tasks requiring 8–12 primitive skills in sequence. These require the LLM to understand abstract instructions, decompose them into ordered sub-goals, and maintain context across many steps.

Example: Workout recovery

"I just worked out, can you bring me a drink and a snack to recover?"

SayCan understands "recover from a workout" implies something healthy, and produces:

Find a water bottle

↓

Pick up the water bottle

↓

Bring it to you

↓

Find an apple

↓

Pick up the apple

↓

Bring it to you

↓

Done

Notice: the LLM chose water (not soda) and apple (not chips) because those are healthier — semantic knowledge from pre-training. The affordances confirmed both objects are present in the kitchen.

Example: Spill cleanup

"I left out a coke, apple, and water, can you throw them away and then bring me a sponge to wipe the table?"

This requires: (1) understanding that three objects need to be thrown away, (2) executing find → pick → dispose for each, (3) then finding and bringing a sponge. SayCan produces an 11-step plan and executes it successfully, interacting with a large portion of the kitchen.

State feedback matters: Because SayCan re-queries affordances at every step, it adapts to the evolving state. After picking up the coke, the value function for "pick up the coke" drops to zero (it’s already in hand), and "go to the trash" rises. This implicit state tracking lets SayCan chain skills without an explicit world model or state tracker.

Where long-horizon planning fails

Long-horizon tasks had the lowest success rate (73% planning, 47% execution). The main failure mode is early termination: the LLM selects "done" before all sub-goals are completed, typically after bringing the first item but forgetting the second. This is a known limitation of autoregressive planning — the model doesn’t have an explicit mechanism to track which sub-goals have been satisfied.

What is the primary failure mode of SayCan on long-horizon tasks?

Early termination — the LLM selects "done" before all sub-goals are met, because autoregressive models don’t explicitly track which goals have been satisfied The robot runs out of battery The value functions become inaccurate over time

Chapter 9: Connections

What SayCan built on

PaLM (Chowdhery et al., 2022): The 540B-parameter LLM used for task scoring. SayCan showed that larger LLMs directly improve robot performance — 540B PaLM halved errors vs. FLAN 137B.

BC-Z (Jang et al., 2022): The behavior cloning framework used for training language-conditioned manipulation policies from demonstrations.

MT-Opt (Kalashnikov et al., 2021): Multi-task RL framework that provides the language-conditioned value functions used as affordance models.

What SayCan inspired

Inner Monologue (Huang et al., 2022): Extends SayCan by feeding environment feedback (success/failure signals, scene descriptions) back into the LLM context. This closes the loop — the LLM can re-plan when a skill fails.

Code as Policies (Liang et al., 2023): Instead of scoring skills, the LLM generates executable Python code that calls robot APIs. More flexible than SayCan’s fixed skill library, but requires the LLM to understand the API.

RT-1 (Brohan et al., 2023): A Transformer policy trained on 130k demonstrations across 700+ tasks. Unifies the "Say" and "Can" into a single model that directly maps language instructions to robot actions, scaling up the skill library approach.

RT-2 (Brohan et al., 2023): Combines a vision-language model (PaLI-X/PaLM-E) with robot action tokens. The VLM itself embeds both semantic knowledge and physical grounding — achieving SayCan’s goal of combining language and affordances, but within a single model.

PaLM-E (Driess et al., 2023): An embodied multimodal LLM that takes in images and text and outputs plans. Internalizes the affordance grounding that SayCan achieves through separate value functions.

VLAs (RT-2, pi0, etc.): Vision-Language-Action models represent the end state of the trajectory SayCan began. Instead of separate Say and Can modules, a single foundation model maps from vision + language to actions, with grounding learned implicitly from robot data.

SayCan’s legacy: SayCan was the first convincing demonstration that frozen LLMs could direct real robots on long-horizon tasks. Its Say × Can factorization — separate modules for "what should I do" and "what can I do" — became the blueprint for the entire field of language-conditioned robot planning. Every subsequent system (RT-1, RT-2, PaLM-E, Code-as-Policies) can be understood as either improving, unifying, or replacing one of SayCan’s two modules.

Cheat sheet

Core equation

π* = argmax_π∈Π p(ℓ_π | i) × p(c_π | s, ℓ_π)

Say

LLM scores skills by semantic relevance to the instruction

Can

Value functions score skills by physical feasibility in current state

Scale

551 skills, 17 objects, 101 tasks, 84% planning success

Impact

Blueprint for RT-1, RT-2, PaLM-E, Code-as-Policies, VLAs

How do later systems like RT-2 and PaLM-E differ from SayCan’s approach?

They unify SayCan’s separate "Say" (LLM) and "Can" (value function) modules into a single vision-language-action model that learns both semantic understanding and physical grounding end-to-end They use a smaller language model They remove the value functions entirely without replacement

Do As I Can, Not As I Say