Ahn, Brohan, Brown, Chebotar, Hausman, Levine et al. — Google, 2022

Do As I Can, Not As I Say

Grounding Language in Robotic Affordances — LLMs know what to do but not what a robot can do. SayCan bridges this gap by scoring LLM plans with learned value functions, producing action sequences that are both semantically correct and physically feasible.

Prerequisites: LLM basics + Value functions (RL) + Behavior cloning
10
Chapters
5+
Simulations

Chapter 0: The Problem

You walk into your kitchen and say: "I just spilled my drink, can you help?" A large language model responds: "You could try using a vacuum cleaner." Reasonable advice for a human, perhaps. Completely useless for a robot that has no vacuum, cannot leave the kitchen, and can only pick, place, and navigate.

This is the grounding problem. LLMs absorb enormous knowledge about everyday tasks from web text — how to cook, clean, organize — but they have never done any of it. They have never opened a drawer, felt the weight of a sponge, or checked whether an apple is within arm’s reach. Their plans are written for an idealized agent with unlimited capabilities in an infinite environment.

Meanwhile, robots are the opposite. A mobile manipulator in a kitchen can pick up a sponge, navigate to the counter, and place objects in the trash with high reliability. It knows exactly what it can do. But if you tell it "I spilled my drink, can you help?" in raw natural language, it has no idea how to decompose that into its known skills.

The fundamental mismatch: LLMs know what to do but not what a robot can do. Robots know what they can do but not what to do. SayCan fuses both: the LLM provides task knowledge ("Say") and learned affordance functions provide feasibility knowledge ("Can").

Before SayCan, there were two dead ends. You could feed the raw instruction into a language-conditioned policy — but these policies only understand short, primitive commands like "pick up the can," not abstract requests like "help me clean up." Or you could use the LLM to generate a step-by-step plan — but without grounding, it might suggest skills the robot doesn’t have, objects that aren’t present, or actions that are physically impossible from the current state.

Why can't an LLM alone produce reliable plans for a specific robot in a specific environment?

Chapter 1: The Key Insight

SayCan’s insight is beautifully simple. Instead of asking the LLM to generate a plan (which might hallucinate impossible actions), ask it to score a fixed set of skills the robot already knows. Then multiply that score by a second score from the robot’s own value functions that captures "can I actually do this right now?"

score(π) = p(ℓπ | i) × p(cπ | s, ℓπ)

Where:

The robot selects the skill with the highest combined score, executes it, appends it to the LLM context, and repeats until a termination token is selected.

Why multiplication? Because we want both conditions to be true simultaneously. A skill that is useful but infeasible (high Say, low Can) gets suppressed. A skill that is feasible but irrelevant (low Say, high Can) also gets suppressed. Only skills that are both useful and doable rise to the top. This is a probabilistic AND — the product of two independent probabilities.

This factorization has a clean probabilistic interpretation. We want p(ci | i, s, ℓπ) — the probability that executing skill π successfully makes progress toward completing instruction i from state s. Assuming a skill that fails makes zero progress and a skill that succeeds makes progress with probability p(ℓπ | i), we get the factorization above.

Say × Can Scoring

Each bar shows LLM score (Say), value function score (Can), and their product. The highest combined score wins. Drag the sliders to change the scenario.

Scenario
In the SayCan scoring formula, what does multiplying the LLM probability by the value function probability achieve?

Chapter 2: LLMs for Planning

How do you get an LLM to break "I spilled my drink, can you help?" into a sequence of robot primitives? SayCan uses two techniques: prompt engineering and scoring mode.

Prompt engineering

The LLM is prompted with examples of a dialog between a user and a robot. The prompt shows the format: the user gives a high-level instruction, and the robot responds with "I would: 1. [skill], 2. [skill], ... done." This few-shot structure teaches the LLM the expected output format without any fine-tuning.

Scoring mode vs. generation mode

A critical design choice: instead of letting the LLM generate arbitrary text (which might produce skills the robot doesn’t have), SayCan uses the LLM in scoring mode. Given the instruction and dialog history so far, the LLM evaluates every skill description in the robot’s repertoire and assigns a probability to each one. This constrains the output to valid skills by construction.

Scoring vs. generative: The paper compares scoring mode against a generative baseline that freely generates text and then projects it to the nearest skill via cosine similarity with USE embeddings. The generative approach achieves 74% planning success vs. SayCan’s 84%. The gap comes from cases where the generated text doesn’t cleanly map to any available skill.

Iterative planning

SayCan doesn’t generate the full plan at once. It selects one skill, executes it, appends it to the dialog, and queries the LLM again. This means each step conditions on the current state of the world (via updated affordances) and the history of what’s been done (via the growing dialog). The LLM effectively maintains a chain of thought through the dialog context.

LLM Scoring Over Skills

The LLM scores every skill in the robot’s repertoire. Each bar shows p(ℓπ | instruction). Click skills to "execute" them and see how the distribution shifts.

Step 0: awaiting instruction
Why does SayCan use the LLM in scoring mode rather than generation mode?

Chapter 3: Affordance Functions

The word "affordance" comes from ecological psychology — it means what the environment affords (allows) an agent to do. A cup affords grasping. An empty table affords placing. A closed drawer affords opening. SayCan operationalizes this concept through value functions.

Value functions as affordances

For each skill π with language description ℓπ, a value function V(s, ℓπ) is trained to predict the probability that the skill will complete successfully from state s. The training uses sparse rewards: +1 if the skill succeeds, 0 otherwise. In the undiscounted, sparse reward setting, this value function directly equals the success probability:

p(cπ | s, ℓπ) = V(s, ℓπ) ≈ P(success | current state, skill description)

What makes this work

The value function is state-conditioned. It takes in the robot’s current camera image and the skill description, and outputs a probability. This means:

The key realization: Value functions trained via RL on simple manipulation tasks naturally learn to be affordance detectors. A value function for "pick up the can" learns to recognize when a can is present, reachable, and graspable — exactly the information needed to ground LLM plans in physical reality. No separate perception module needed.

Multi-task training

Instead of training 551 separate value functions, SayCan trains a single multi-task model conditioned on the language description. The skill command is encoded via a pre-trained sentence encoder (Universal Sentence Encoder), and the resulting embedding is fed into the value network alongside the image observation. This amortizes training cost and enables generalization across similar skills.

Why do sparse-reward value functions (reward = 1 on success, 0 otherwise) naturally function as affordance detectors?

Chapter 4: The SayCan Algorithm

Now let’s put it all together. The SayCan algorithm is a loop: at each step, score every skill, pick the best, execute it, update the context, and repeat.

Algorithm

Input
High-level instruction i, current state s0, skill library Π with descriptions ℓΠ
Score (Say)
For each skill π: compute pLLMπ = p(ℓπ | i, history)
Score (Can)
For each skill π: compute paffπ = p(cπ | sn, ℓπ) from value function
Combine
pcombinedπ = paffπ × pLLMπ — select πn = argmax
Execute
Run skill πn on robot, update state sn+1
Append
Add ℓπn to dialog history, n = n + 1
Repeat
Loop until "done" token is selected as the best skill
The elegance: SayCan requires no fine-tuning of the LLM. The language model is used purely as a scoring function, frozen. The value functions are trained separately via RL. At inference time, the only computation is: score each skill with the LLM, score each skill with the value function, multiply, take argmax. The plan emerges from the interaction between these two scoring systems.
SayCan Decision Loop

Watch SayCan plan step by step. At each step, the LLM scores skills (blue), value functions score feasibility (green), and the product (gold) selects the next action. Press Step to advance.

Instruction: "I spilled my drink, help?"
After executing a skill, what does SayCan do before selecting the next skill?

Chapter 5: The Skill Library

SayCan’s robot has a repertoire of 551 skills spanning seven families, operating with 17 objects in a kitchen environment. Each skill is a short, language-described behavior with its own policy and value function.

Skill families

Training the policies

The skill policies are trained via two methods:

A crucial decoupling: The BC policies execute the skills, but the RL-trained value functions score feasibility. You can use whatever policy training method works best for execution, as long as you have value functions from RL to provide affordances. The "brain" (LLM + value functions) and the "body" (execution policies) are separate systems.

Language conditioning

Both policies and value functions receive the skill description as input via a frozen Universal Sentence Encoder (USE). The USE embedding tells the model which skill to perform. Importantly, the sentence encoder used for low-level skill conditioning is different from the LLM used for high-level planning — each language model operates at the abstraction level it’s best suited for.

Why does SayCan use BC policies for execution but RL-trained value functions for affordance scoring?

Chapter 6: Grounding Through Affordances

The affordance grounding is what makes SayCan work in the real world. Without it, the LLM generates plausible-sounding but physically impossible plans. Let’s see exactly how grounding fixes the failure modes.

Failure mode 1: Objects not present

Instruction: "Bring me a fruit." The LLM might score "pick up the banana" highest because bananas are common fruits. But there’s no banana in the kitchen — there’s only an apple. The value function for "pick up the banana" returns ≈0 (it sees no banana), while "pick up the apple" returns ≈0.8. After multiplication, the apple wins despite lower LLM score.

Failure mode 2: Wrong state for the skill

Instruction: "Put the can on the counter." The robot doesn’t have the can yet. The LLM might score "place the can on the counter" highly, but the value function knows the gripper is empty — that skill’s affordance is near zero. Instead, "find a coke can" has high affordance (the robot can navigate), so it gets selected first.

Failure mode 3: Hallucinated capabilities

Without the skill-constrained scoring approach, a generative LLM might suggest "use a vacuum cleaner" or "call for help" — actions completely outside the robot’s capability set. SayCan eliminates this by only scoring skills the robot actually has.

Grounding in Action

Compare ungrounded LLM planning (left) vs. SayCan with affordance grounding (right). Toggle scenes to see how affordances redirect the plan to feasible actions.

Ablation result: Without value function grounding ("No VF"), planning success drops from 84% to 67%. Embodiment-specific tasks drop from 64% to just 18%. The affordance grounding is most critical when the task requires understanding the robot’s current physical state.
How does affordance grounding handle the case where the LLM’s top-scoring skill involves an object not present in the scene?

Chapter 7: Results

SayCan was evaluated on 101 real-world robotic tasks in two kitchen environments: a mock kitchen (training environment) and a real office kitchen (novel environment). The robot is a mobile manipulator from Everyday Robots with a 7-DOF arm and two-fingered gripper.

Headline numbers

The gap between planning and execution success reflects that even correct plans can fail during physical execution (grasping failures, navigation errors, etc.).

Ablation results

The paper systematically ablates both the LLM and the affordance grounding:

Results Comparison

Planning success rates across methods. SayCan with full affordance grounding outperforms all ablations.

Error analysis

Of all failures, 65% were LLM errors (wrong skill selected) and 35% were affordance errors (value function misclassified feasibility). Common LLM failure modes include early termination (stopping before all sub-goals are met), negation errors ("not an apple" → selects apple), and ambiguous references.

Scaling with better LLMs: PaLM 540B achieves 84% planning success, while FLAN 137B achieves only 72%. This shows that a robot’s task performance can be improved simply by upgrading the underlying language model — no retraining of any robotic component required.
What percentage of SayCan’s failures are due to LLM errors vs. affordance function errors?

Chapter 8: Long-Horizon Tasks

SayCan’s most impressive demonstrations are on long-horizon tasks requiring 8–12 primitive skills in sequence. These require the LLM to understand abstract instructions, decompose them into ordered sub-goals, and maintain context across many steps.

Example: Workout recovery

"I just worked out, can you bring me a drink and a snack to recover?"

SayCan understands "recover from a workout" implies something healthy, and produces:

1
Find a water bottle
2
Pick up the water bottle
3
Bring it to you
4
Find an apple
5
Pick up the apple
6
Bring it to you
7
Done

Notice: the LLM chose water (not soda) and apple (not chips) because those are healthier — semantic knowledge from pre-training. The affordances confirmed both objects are present in the kitchen.

Example: Spill cleanup

"I left out a coke, apple, and water, can you throw them away and then bring me a sponge to wipe the table?"

This requires: (1) understanding that three objects need to be thrown away, (2) executing find → pick → dispose for each, (3) then finding and bringing a sponge. SayCan produces an 11-step plan and executes it successfully, interacting with a large portion of the kitchen.

State feedback matters: Because SayCan re-queries affordances at every step, it adapts to the evolving state. After picking up the coke, the value function for "pick up the coke" drops to zero (it’s already in hand), and "go to the trash" rises. This implicit state tracking lets SayCan chain skills without an explicit world model or state tracker.

Where long-horizon planning fails

Long-horizon tasks had the lowest success rate (73% planning, 47% execution). The main failure mode is early termination: the LLM selects "done" before all sub-goals are completed, typically after bringing the first item but forgetting the second. This is a known limitation of autoregressive planning — the model doesn’t have an explicit mechanism to track which sub-goals have been satisfied.

What is the primary failure mode of SayCan on long-horizon tasks?

Chapter 9: Connections

What SayCan built on

PaLM (Chowdhery et al., 2022): The 540B-parameter LLM used for task scoring. SayCan showed that larger LLMs directly improve robot performance — 540B PaLM halved errors vs. FLAN 137B.

BC-Z (Jang et al., 2022): The behavior cloning framework used for training language-conditioned manipulation policies from demonstrations.

MT-Opt (Kalashnikov et al., 2021): Multi-task RL framework that provides the language-conditioned value functions used as affordance models.

What SayCan inspired

Inner Monologue (Huang et al., 2022): Extends SayCan by feeding environment feedback (success/failure signals, scene descriptions) back into the LLM context. This closes the loop — the LLM can re-plan when a skill fails.

Code as Policies (Liang et al., 2023): Instead of scoring skills, the LLM generates executable Python code that calls robot APIs. More flexible than SayCan’s fixed skill library, but requires the LLM to understand the API.

RT-1 (Brohan et al., 2023): A Transformer policy trained on 130k demonstrations across 700+ tasks. Unifies the "Say" and "Can" into a single model that directly maps language instructions to robot actions, scaling up the skill library approach.

RT-2 (Brohan et al., 2023): Combines a vision-language model (PaLI-X/PaLM-E) with robot action tokens. The VLM itself embeds both semantic knowledge and physical grounding — achieving SayCan’s goal of combining language and affordances, but within a single model.

PaLM-E (Driess et al., 2023): An embodied multimodal LLM that takes in images and text and outputs plans. Internalizes the affordance grounding that SayCan achieves through separate value functions.

VLAs (RT-2, pi0, etc.): Vision-Language-Action models represent the end state of the trajectory SayCan began. Instead of separate Say and Can modules, a single foundation model maps from vision + language to actions, with grounding learned implicitly from robot data.

SayCan’s legacy: SayCan was the first convincing demonstration that frozen LLMs could direct real robots on long-horizon tasks. Its Say × Can factorization — separate modules for "what should I do" and "what can I do" — became the blueprint for the entire field of language-conditioned robot planning. Every subsequent system (RT-1, RT-2, PaLM-E, Code-as-Policies) can be understood as either improving, unifying, or replacing one of SayCan’s two modules.

Cheat sheet

Core equation
π* = argmaxπ∈Π p(ℓπ | i) × p(cπ | s, ℓπ)
Say
LLM scores skills by semantic relevance to the instruction
Can
Value functions score skills by physical feasibility in current state
Scale
551 skills, 17 objects, 101 tasks, 84% planning success
Impact
Blueprint for RT-1, RT-2, PaLM-E, Code-as-Policies, VLAs
How do later systems like RT-2 and PaLM-E differ from SayCan’s approach?