Grounding Language in Robotic Affordances — LLMs know what to do but not what a robot can do. SayCan bridges this gap by scoring LLM plans with learned value functions, producing action sequences that are both semantically correct and physically feasible.
You walk into your kitchen and say: "I just spilled my drink, can you help?" A large language model responds: "You could try using a vacuum cleaner." Reasonable advice for a human, perhaps. Completely useless for a robot that has no vacuum, cannot leave the kitchen, and can only pick, place, and navigate.
This is the grounding problem. LLMs absorb enormous knowledge about everyday tasks from web text — how to cook, clean, organize — but they have never done any of it. They have never opened a drawer, felt the weight of a sponge, or checked whether an apple is within arm’s reach. Their plans are written for an idealized agent with unlimited capabilities in an infinite environment.
Meanwhile, robots are the opposite. A mobile manipulator in a kitchen can pick up a sponge, navigate to the counter, and place objects in the trash with high reliability. It knows exactly what it can do. But if you tell it "I spilled my drink, can you help?" in raw natural language, it has no idea how to decompose that into its known skills.
Before SayCan, there were two dead ends. You could feed the raw instruction into a language-conditioned policy — but these policies only understand short, primitive commands like "pick up the can," not abstract requests like "help me clean up." Or you could use the LLM to generate a step-by-step plan — but without grounding, it might suggest skills the robot doesn’t have, objects that aren’t present, or actions that are physically impossible from the current state.
SayCan’s insight is beautifully simple. Instead of asking the LLM to generate a plan (which might hallucinate impossible actions), ask it to score a fixed set of skills the robot already knows. Then multiply that score by a second score from the robot’s own value functions that captures "can I actually do this right now?"
Where:
The robot selects the skill with the highest combined score, executes it, appends it to the LLM context, and repeats until a termination token is selected.
This factorization has a clean probabilistic interpretation. We want p(ci | i, s, ℓπ) — the probability that executing skill π successfully makes progress toward completing instruction i from state s. Assuming a skill that fails makes zero progress and a skill that succeeds makes progress with probability p(ℓπ | i), we get the factorization above.
Each bar shows LLM score (Say), value function score (Can), and their product. The highest combined score wins. Drag the sliders to change the scenario.
How do you get an LLM to break "I spilled my drink, can you help?" into a sequence of robot primitives? SayCan uses two techniques: prompt engineering and scoring mode.
The LLM is prompted with examples of a dialog between a user and a robot. The prompt shows the format: the user gives a high-level instruction, and the robot responds with "I would: 1. [skill], 2. [skill], ... done." This few-shot structure teaches the LLM the expected output format without any fine-tuning.
A critical design choice: instead of letting the LLM generate arbitrary text (which might produce skills the robot doesn’t have), SayCan uses the LLM in scoring mode. Given the instruction and dialog history so far, the LLM evaluates every skill description in the robot’s repertoire and assigns a probability to each one. This constrains the output to valid skills by construction.
SayCan doesn’t generate the full plan at once. It selects one skill, executes it, appends it to the dialog, and queries the LLM again. This means each step conditions on the current state of the world (via updated affordances) and the history of what’s been done (via the growing dialog). The LLM effectively maintains a chain of thought through the dialog context.
The LLM scores every skill in the robot’s repertoire. Each bar shows p(ℓπ | instruction). Click skills to "execute" them and see how the distribution shifts.
The word "affordance" comes from ecological psychology — it means what the environment affords (allows) an agent to do. A cup affords grasping. An empty table affords placing. A closed drawer affords opening. SayCan operationalizes this concept through value functions.
For each skill π with language description ℓπ, a value function V(s, ℓπ) is trained to predict the probability that the skill will complete successfully from state s. The training uses sparse rewards: +1 if the skill succeeds, 0 otherwise. In the undiscounted, sparse reward setting, this value function directly equals the success probability:
The value function is state-conditioned. It takes in the robot’s current camera image and the skill description, and outputs a probability. This means:
Instead of training 551 separate value functions, SayCan trains a single multi-task model conditioned on the language description. The skill command is encoded via a pre-trained sentence encoder (Universal Sentence Encoder), and the resulting embedding is fed into the value network alongside the image observation. This amortizes training cost and enables generalization across similar skills.
Now let’s put it all together. The SayCan algorithm is a loop: at each step, score every skill, pick the best, execute it, update the context, and repeat.
Watch SayCan plan step by step. At each step, the LLM scores skills (blue), value functions score feasibility (green), and the product (gold) selects the next action. Press Step to advance.
SayCan’s robot has a repertoire of 551 skills spanning seven families, operating with 17 objects in a kitchen environment. Each skill is a short, language-described behavior with its own policy and value function.
The skill policies are trained via two methods:
Both policies and value functions receive the skill description as input via a frozen Universal Sentence Encoder (USE). The USE embedding tells the model which skill to perform. Importantly, the sentence encoder used for low-level skill conditioning is different from the LLM used for high-level planning — each language model operates at the abstraction level it’s best suited for.
The affordance grounding is what makes SayCan work in the real world. Without it, the LLM generates plausible-sounding but physically impossible plans. Let’s see exactly how grounding fixes the failure modes.
Instruction: "Bring me a fruit." The LLM might score "pick up the banana" highest because bananas are common fruits. But there’s no banana in the kitchen — there’s only an apple. The value function for "pick up the banana" returns ≈0 (it sees no banana), while "pick up the apple" returns ≈0.8. After multiplication, the apple wins despite lower LLM score.
Instruction: "Put the can on the counter." The robot doesn’t have the can yet. The LLM might score "place the can on the counter" highly, but the value function knows the gripper is empty — that skill’s affordance is near zero. Instead, "find a coke can" has high affordance (the robot can navigate), so it gets selected first.
Without the skill-constrained scoring approach, a generative LLM might suggest "use a vacuum cleaner" or "call for help" — actions completely outside the robot’s capability set. SayCan eliminates this by only scoring skills the robot actually has.
Compare ungrounded LLM planning (left) vs. SayCan with affordance grounding (right). Toggle scenes to see how affordances redirect the plan to feasible actions.
SayCan was evaluated on 101 real-world robotic tasks in two kitchen environments: a mock kitchen (training environment) and a real office kitchen (novel environment). The robot is a mobile manipulator from Everyday Robots with a 7-DOF arm and two-fingered gripper.
The gap between planning and execution success reflects that even correct plans can fail during physical execution (grasping failures, navigation errors, etc.).
The paper systematically ablates both the LLM and the affordance grounding:
Planning success rates across methods. SayCan with full affordance grounding outperforms all ablations.
Of all failures, 65% were LLM errors (wrong skill selected) and 35% were affordance errors (value function misclassified feasibility). Common LLM failure modes include early termination (stopping before all sub-goals are met), negation errors ("not an apple" → selects apple), and ambiguous references.
SayCan’s most impressive demonstrations are on long-horizon tasks requiring 8–12 primitive skills in sequence. These require the LLM to understand abstract instructions, decompose them into ordered sub-goals, and maintain context across many steps.
"I just worked out, can you bring me a drink and a snack to recover?"
SayCan understands "recover from a workout" implies something healthy, and produces:
Notice: the LLM chose water (not soda) and apple (not chips) because those are healthier — semantic knowledge from pre-training. The affordances confirmed both objects are present in the kitchen.
"I left out a coke, apple, and water, can you throw them away and then bring me a sponge to wipe the table?"
This requires: (1) understanding that three objects need to be thrown away, (2) executing find → pick → dispose for each, (3) then finding and bringing a sponge. SayCan produces an 11-step plan and executes it successfully, interacting with a large portion of the kitchen.
Long-horizon tasks had the lowest success rate (73% planning, 47% execution). The main failure mode is early termination: the LLM selects "done" before all sub-goals are completed, typically after bringing the first item but forgetting the second. This is a known limitation of autoregressive planning — the model doesn’t have an explicit mechanism to track which sub-goals have been satisfied.
PaLM (Chowdhery et al., 2022): The 540B-parameter LLM used for task scoring. SayCan showed that larger LLMs directly improve robot performance — 540B PaLM halved errors vs. FLAN 137B.
BC-Z (Jang et al., 2022): The behavior cloning framework used for training language-conditioned manipulation policies from demonstrations.
MT-Opt (Kalashnikov et al., 2021): Multi-task RL framework that provides the language-conditioned value functions used as affordance models.
Inner Monologue (Huang et al., 2022): Extends SayCan by feeding environment feedback (success/failure signals, scene descriptions) back into the LLM context. This closes the loop — the LLM can re-plan when a skill fails.
Code as Policies (Liang et al., 2023): Instead of scoring skills, the LLM generates executable Python code that calls robot APIs. More flexible than SayCan’s fixed skill library, but requires the LLM to understand the API.
RT-1 (Brohan et al., 2023): A Transformer policy trained on 130k demonstrations across 700+ tasks. Unifies the "Say" and "Can" into a single model that directly maps language instructions to robot actions, scaling up the skill library approach.
RT-2 (Brohan et al., 2023): Combines a vision-language model (PaLI-X/PaLM-E) with robot action tokens. The VLM itself embeds both semantic knowledge and physical grounding — achieving SayCan’s goal of combining language and affordances, but within a single model.
PaLM-E (Driess et al., 2023): An embodied multimodal LLM that takes in images and text and outputs plans. Internalizes the affordance grounding that SayCan achieves through separate value functions.
VLAs (RT-2, pi0, etc.): Vision-Language-Action models represent the end state of the trajectory SayCan began. Instead of separate Say and Can modules, a single foundation model maps from vision + language to actions, with grounding learned implicitly from robot data.