Introduction
Imagine standing in a kitchen and asking someone to "move the sponge near the cloth." This instruction seems trivial to a human listener: identify the sponge, identify the cloth, plan a reach-grasp-move trajectory, and place the sponge in the vicinity of the cloth. But for a robot, this single sentence requires solving a cascade of hard problems — visual grounding (which object is the sponge?), spatial reasoning (what does "near" mean in metric coordinates?), and motor control (how do I move my arm there without collisions?).
Language-conditioned policies are neural network policies that take both sensory observations and natural language instructions as input and produce motor actions as output. They represent a fundamental shift from the traditional robotics paradigm where tasks were specified through goal positions, reward functions, or hard-coded task indices. Language provides a compositional, open-ended, and human-compatible interface for specifying robot behavior.
The key insight driving this field is that pretrained language models already encode rich semantic structure — "pick up" and "grasp" are nearby in embedding space, "red cup" and "blue cup" share most of their representation — and this structure can be exploited to build policies that generalize to novel instructions without additional training data.
Why language is the natural task interface for robots, how language embeddings serve as goal representations, three mechanisms for conditioning policies on language (FiLM, cross-attention, token concatenation), landmark multi-task systems (CLIPort, BC-Z, SayCan, Gato), the RT-1 and RT-2 architectures, semantic generalization to unseen instructions, and the fundamental limitations of language-conditioned control.
Language as Goal Specification
Task IDs vs. language
The simplest way to tell a multi-task policy which task to perform is a one-hot task identifier: task 0 = "pick up red block," task 1 = "open drawer," task 2 = "push button," and so on. The task ID is concatenated with the observation and fed to the policy network. This approach is used in many multi-task reinforcement learning papers and works well when the task set is small and fixed.
But one-hot IDs have three critical limitations:
- No generalization. Tasks 0 and 1 are orthogonal in the one-hot representation, regardless of their semantic similarity. The policy cannot transfer knowledge from "pick up red block" to "pick up blue block" because their representations share zero structure.
- Fixed task set. Adding a new task requires changing the input dimensionality of the network. You cannot specify a task at test time that wasn't in the training set.
- No compositionality. One-hot vectors cannot represent combinations of known concepts. There is no way to compose "pick up" with "green cylinder" if that specific combination was never trained.
Language embeddings solve all three problems. Instead of a one-hot vector, the task is specified as a natural language string (e.g., "pick up the green cylinder"), processed by a pretrained language encoder into a dense vector (typically 512 or 768 dimensions), and this embedding conditions the policy. Similar instructions produce similar embeddings, so the policy can interpolate between known tasks in embedding space.
Language: π(a | o, f(l)) where f: Σ* → Rd is a pretrained encoder
Generalization to unseen instructions
The promise of language conditioning is zero-shot generalization: if the policy has learned to execute "pick up the red block" and "push the blue cylinder," it may be able to execute "pick up the blue cylinder" without ever having seen that specific instruction during training. This works because the language encoder decomposes the instruction into semantic components — an action concept ("pick up") and an object concept ("blue cylinder") — and the policy learns to respond to these components independently.
Lynch and Sermanet (2021) demonstrated this with their Language-Conditioned Imitation Learning framework, showing that policies trained with language-labeled demonstrations could generalize to novel instruction phrasings (e.g., "grasp the can" when trained on "pick up the can") and novel attribute compositions (e.g., "pick up the green object" when only "pick up the red object" and "push the green object" were in training data).
Language-conditioned generalization rests on a strong assumption: that the policy's behavior decomposes along the same lines as the language. If the embedding for "pick up green cylinder" is roughly "pick up" + "green" + "cylinder," then a policy that has seen each component in different combinations should generalize to new combinations. In practice, this works for simple attribute-action compositions but breaks down for complex spatial or relational instructions like "place the cup to the left of the tallest block."
Language Embeddings for Robotics
The choice of language encoder is one of the most important design decisions in a language-conditioned policy. The encoder must produce embeddings where semantically similar instructions are close together, and where the structure of the embedding space aligns with the structure of the task space. Not all language encoders are equally suited to this role.
Sentence-level encoders
Early work in language-conditioned robotics used general-purpose sentence encoders:
- Universal Sentence Encoder (Cer et al., 2018): Produces 512-d embeddings using a transformer or deep averaging network trained on a variety of NLP tasks. Used in early language-conditioned manipulation work due to its simplicity and availability.
- BERT sentence embeddings (Devlin et al., 2019): BERT's [CLS] token or mean-pooled token embeddings provide 768-d sentence representations. However, raw BERT embeddings are not optimized for semantic similarity — two sentences with similar meaning may have distant embeddings unless the model is fine-tuned.
- Sentence-BERT (Reimers and Gurevych, 2019): Fine-tunes BERT with a siamese architecture on natural language inference (NLI) data, producing embeddings where cosine similarity correlates with semantic similarity. This is crucial for robotics: we need "pick up the cup" and "grasp the mug" to be close in embedding space.
The critical property for robotics is semantic smoothness: small changes in instruction meaning should produce small changes in embedding space. If "move left" and "move right" are distant, but "move left" and "translate leftward" are close, the policy can generalize across paraphrases while distinguishing genuinely different tasks.
CLIP text encoder
CLIP's text encoder (Radford et al., 2021) has become the dominant choice for VLA systems. Its key advantage is that it was trained jointly with a vision encoder, so text embeddings already live in the same space as visual features. The instruction "red cup" is close to the visual features of a red cup image, providing a natural bridge between language and perception.
CLIP uses a 12-layer, 512-d transformer (for ViT-B/32) or a larger 12-layer, 768-d transformer (for ViT-L/14) as its text encoder. Text is tokenized with byte-pair encoding (BPE), and the embedding of the end-of-sequence token is projected to the shared embedding space via a learned linear projection.
For robotics, CLIP text embeddings have two useful properties:
- Object grounding. Because CLIP was trained on image-text pairs, the text encoder learns to associate words with their visual appearances. "Red cup" activates in a region of embedding space corresponding to images of red cups.
- Action abstraction. While CLIP was not trained on action language specifically, the text encoder still captures verb semantics. "Pick up," "lift," and "raise" are closer together than "pick up" and "push."
SigLIP (Zhai et al., 2023) offers an improved alternative with a sigmoid-based contrastive loss (rather than softmax), better calibrated embeddings, and slightly stronger performance. Both RT-2 and OpenVLA use pretrained vision-language model text encoders rather than standalone sentence encoders.
Embedding comparison
| Encoder | Dim | Vision-Aligned | Semantic Similarity | Used In |
|---|---|---|---|---|
| Universal Sentence Encoder | 512 | No | Moderate | Early LCP work |
| BERT [CLS] | 768 | No | Weak (not fine-tuned) | BC-Z baseline |
| Sentence-BERT | 768 | No | Strong | Language-conditioned IL |
| CLIP ViT-B/32 text | 512 | Yes | Moderate | CLIPort, SayCan |
| CLIP ViT-L/14 text | 768 | Yes | Moderate | RT-1 |
| SigLIP text | 768 | Yes | Strong | RT-2, OpenVLA |
| T5 / PaLM / LLaMA | varies | No (bridged via VLM) | Very strong | RT-2, OpenVLA |
Explore how language embeddings organize instructions. Similar tasks cluster together, enabling generalization between related instructions.
Grounding Language to Actions
Given a language embedding and a visual observation, how do we combine them to produce an action? This is the conditioning mechanism — the architectural component that injects language information into the policy network. Three approaches dominate the literature: FiLM conditioning, cross-attention, and token concatenation.
FiLM conditioning
Feature-wise Linear Modulation (Perez et al., 2018) is a general-purpose conditioning mechanism that modulates intermediate features using learned affine transformations derived from the conditioning input. Given a visual feature map h with channels c, FiLM applies:
where γ(l) = Wγ f(l) + bγ, β(l) = Wβ f(l) + bβ
Here f(l) is the language embedding, and γ and β are channel-wise scale and shift parameters predicted by linear layers from the language embedding. Intuitively, FiLM allows the language instruction to "select" which visual features are relevant: when the instruction says "red cup," the FiLM parameters amplify channels that respond to red objects and suppress irrelevant channels.
FiLM conditioning is computationally cheap (just two linear projections and element-wise operations), does not increase sequence length, and can be inserted at multiple layers of a visual backbone. RT-1 (Brohan et al., 2022) uses FiLM conditioning at every block of its EfficientNet-B3 encoder, allowing language to modulate visual features at multiple scales.
FiLM is particularly well-suited to robotics because it preserves the spatial structure of visual features. Unlike concatenation (which changes the feature dimension) or cross-attention (which requires additional transformer layers), FiLM simply rescales existing spatial features based on the task instruction. This means the spatial information needed for precise manipulation is preserved while being modulated by task-relevant language context.
See how FiLM parameters (scale γ and shift β) derived from language instructions modulate visual feature channels. Different instructions activate different channels.
Cross-attention
Cross-attention treats language tokens as keys and values that visual tokens attend to. Given visual token representations V and language token representations L:
where QV = WQV, KL = WKL, VL = WVL
Cross-attention is more expressive than FiLM because it allows each visual token to attend to different parts of the language instruction. A visual token corresponding to a red region in the image might attend strongly to the word "red" in the instruction, while a token near a drawer handle might attend to "open." This fine-grained, position-dependent modulation enables more precise language grounding.
The cost is computational: cross-attention adds O(NV × NL) computation per layer, where NV is the number of visual tokens and NL the number of language tokens. For a ViT-L with 256 visual tokens and a 20-token instruction, this adds 256 × 20 = 5,120 attention computations per layer. Perceiver-based architectures (Jaegle et al., 2021) mitigate this by using a smaller set of latent tokens that cross-attend to both modalities.
Token concatenation
The simplest and most general approach: tokenize both the language instruction and the visual observation, concatenate them into a single sequence, and process everything with a standard transformer. The self-attention mechanism implicitly learns to cross-attend between modalities.
This is the approach used by RT-2 (Brohan et al., 2023) and essentially all VLM-based VLA models. The language instruction is tokenized as text tokens, visual observations are tokenized as patch embeddings, and both are fed to a large pretrained vision-language model. Actions are produced either as additional output tokens (RT-2) or via a separate action head.
Token concatenation has the advantage of leveraging the full capacity of pretrained VLMs without architectural modifications. The self-attention mechanism can learn arbitrary relationships between visual and language tokens. The disadvantage is computational cost: the full sequence (language + visual + action tokens) must be processed by every transformer layer.
FiLM Conditioning
Language → scale/shift parameters that modulate visual feature channels. Preserves spatial structure. Used in RT-1, BC-Z.
Cross-Attention
Visual tokens attend to language tokens with learned queries/keys/values. Position-dependent grounding. Used in CLIPort, Perceiver-based models.
Token Concatenation
Concatenate language + visual tokens into one sequence. Leverage pretrained VLM attention. Used in RT-2, OpenVLA, Gato.
Goal-Conditioned + Language
Use language embedding as a goal vector in a goal-conditioned RL framework. Embedding replaces goal image/state. Used in Lynch & Sermanet (2021).
Multi-Task Policies
Language conditioning enables a single policy to perform many tasks, specified through natural language at inference time. This section surveys the landmark systems that established the field of multi-task language-conditioned manipulation.
CLIPort (Shridhar et al., 2022)
CLIPort combines CLIP's semantic understanding with Transporter Networks' (Zeng et al., 2021) spatial precision for tabletop manipulation. It uses a two-stream architecture:
- Semantic stream: CLIP encodes both the language instruction and the visual observation into a shared embedding space. CLIP features identify what to manipulate.
- Spatial stream: A ResNet-based Transporter Network provides pixel-precise pick-and-place affordances. It identifies where to act.
The two streams are fused via element-wise multiplication — the semantic stream gates the spatial stream, so only semantically relevant locations receive high affordance scores. Given the instruction "put the red block in the green bowl," CLIP features activate on the red block and green bowl, while the Transporter provides precise pixel coordinates for the pick and place actions.
CLIPort demonstrated impressive generalization: trained on 1,000 demonstrations for 10 tasks, it achieved strong performance on unseen attribute-object combinations and novel spatial configurations. However, it is limited to top-down tabletop pick-and-place and does not handle more complex manipulation skills.
BC-Z (Jang et al., 2022)
BC-Z (Behavioral Cloning — Zero-shot) was one of the first systems to demonstrate zero-shot generalization to unseen tasks through language conditioning at scale. Trained on 100 tasks with 25,936 demonstrations collected on real robots, BC-Z uses a ResNet-18 visual encoder and a sentence encoder (with experiments comparing Universal Sentence Encoder, CLIP, and MUSE embeddings) to condition a policy via FiLM layers.
The key finding was that language conditioning outperformed one-hot task IDs when evaluated on held-out tasks, confirming the generalization hypothesis. BC-Z also showed that the choice of language encoder matters significantly — CLIP embeddings outperformed Universal Sentence Encoder, and task-specific fine-tuned embeddings outperformed generic ones.
SayCan (Ahn et al., 2022)
SayCan takes a different approach: rather than conditioning a single policy on arbitrary language, it uses a large language model (PaLM) to decompose high-level instructions into sequences of low-level skills, each of which has its own language-conditioned policy. The key innovation is the affordance function: each candidate skill has an associated value function that estimates the probability of successful execution given the current observation.
The LLM scores candidate skills by language likelihood (what should be done), and the value function scores them by affordance (what can be done). The product of these two scores selects the next skill to execute:
SayCan demonstrated that combining the world knowledge of LLMs with the physical grounding of learned affordances enables robots to execute complex, multi-step instructions like "I spilled my drink, can you help?" (which requires finding a sponge, picking it up, bringing it to the spill, etc.). It operated over a vocabulary of 551 skills, each conditioned on language.
Gato (Reed et al., 2022)
Gato from DeepMind is a generalist agent that uses a single transformer to perform over 600 tasks across different domains: Atari games, text generation, image captioning, and robotic manipulation. For robotics tasks, Gato tokenizes observations (images as ViT patch embeddings), language instructions (as text tokens), proprioceptive state, and actions into a single sequence and processes everything with a 1.2B parameter transformer.
Gato uses token concatenation exclusively — no FiLM, no cross-attention. The language instruction is simply prepended to the observation-action sequence. Despite its simplicity, Gato demonstrated that a single set of weights can perform both language and manipulation tasks, suggesting that the representations learned for language modeling are useful for action prediction.
Compare how one-hot task IDs and language embeddings organize task representations. Language enables interpolation and generalization; one-hot IDs are orthogonal.
Language-Conditioned Architectures
RT-1: FiLM-conditioned EfficientNet
Robotics Transformer 1 (Brohan et al., 2022) is the architecture that scaled language-conditioned manipulation to 130k real-world demonstrations across 700+ tasks. Its language conditioning uses a two-stage process:
- Language encoding: The instruction is encoded by a pretrained Universal Sentence Encoder (USE) into a 512-d embedding vector.
- FiLM conditioning: This embedding is projected through learned linear layers to produce scale (γ) and shift (β) parameters for each convolutional block in an EfficientNet-B3 backbone. The visual features are modulated at every stage, allowing the language instruction to influence feature extraction from early edges to high-level semantics.
After FiLM conditioning, TokenLearner (Ryoo et al., 2021) compresses the spatial feature map from 100+ tokens down to 8 tokens via learned spatial attention, and a small transformer (8 layers, 8 heads, 512-d) processes the token sequence to predict discretized actions. The action space is discretized into 256 bins per dimension, with 11 action dimensions (7 arm joints + 3 base velocities + 1 gripper).
RT-1 demonstrated that FiLM conditioning scales effectively: with 130k demonstrations, the same architecture handles 700+ tasks including picking, placing, opening, closing, and moving objects. Performance on seen tasks was 97% success rate, with 76% on unseen object-task combinations.
RT-2: language and actions as token sequences
RT-2 (Brohan et al., 2023) takes a fundamentally different approach: instead of using language to modulate a vision backbone, it treats language and actions as part of the same token sequence, processed by a pretrained vision-language model (PaLI-X at 55B parameters or PaLM-E at 12B).
The key insight is that actions can be represented as text tokens. RT-2 discretizes each action dimension into 256 bins and maps each bin to a token in the VLM's vocabulary (specifically, integer tokens 0–255). A 7-DoF action becomes a sequence of 7 tokens, naturally integrated with the text token vocabulary. The VLM is then fine-tuned on robot data where inputs are (image, instruction) pairs and outputs are action token sequences.
Output: [128] [64] [132] [96] [88] [200] [1]
x y z rx ry rz grip
This design eliminates the need for any custom conditioning mechanism. The VLM's self-attention naturally handles vision-language-action interactions through the same mechanism it uses for visual question answering. The pretrained weights provide strong language understanding and visual grounding, and fine-tuning on robot data teaches the model to generate appropriate action tokens.
RT-2's results showed dramatic improvements in semantic generalization. Compared to RT-1, RT-2 achieved 3x better performance on novel object-task combinations, and could even respond to instructions requiring chain-of-thought reasoning (e.g., "move the object that is not a fruit to the other side").
The evolution from RT-1 to RT-2 represents a paradigm shift in how language conditions robot behavior. RT-1 uses language as a conditioning signal that modulates a specialized robotics architecture. RT-2 uses language as a shared medium — instructions and actions are both token sequences processed by the same general-purpose model. This shift enables transfer from internet-scale language and vision data but sacrifices the spatial precision guarantees of architectures specifically designed for manipulation.
Semantic Generalization
The central promise of language conditioning is that the structure of language embedding space enables generalization beyond the training distribution. This section examines when and how this actually works.
Paraphrase generalization is the simplest form: executing "grasp the can" when trained on "pick up the can." This works reliably because pretrained language encoders place paraphrases close together in embedding space. BC-Z reported near-identical performance on paraphrased instructions, confirming that the policy responds to the embedding, not the specific word sequence.
Attribute-object generalization is more interesting: executing "pick up the green block" when trained on "pick up the red block" and "push the green block" separately. This requires the policy to decompose behavior along the same dimensions as language — separating the action ("pick up") from the object descriptor ("green block") and recombining them. CLIPort, BC-Z, and RT-1 all demonstrate this capability to varying degrees.
Semantic category generalization goes further: executing "pick up the fruit" when trained only on specific fruit names ("pick up the apple," "pick up the banana"). This requires the language encoder to understand hypernym-hyponym relationships. CLIP and LLM-based encoders handle this naturally because they encode category hierarchies.
Spatial and relational generalization is where current methods struggle. Instructions like "place the cup to the left of the tallest block" require compositional spatial reasoning that most language encoders do not explicitly represent. The embedding of this instruction may not decompose neatly into "left of" and "tallest," making it difficult for the policy to generalize from simpler spatial instructions.
See how novel instructions map to embedding space and activate learned behaviors. The policy generalizes by interpolating between known instruction embeddings.
Limitations
Language-conditioned policies are a powerful framework, but they face several fundamental challenges that remain active areas of research.
Ambiguity. Natural language is inherently ambiguous. "Pick up the cup" is unambiguous when there is one cup, but what if there are three? "Move it over there" requires resolving two pronouns and a spatial reference. Current language-conditioned policies typically assume unambiguous instructions and fail gracefully (by choosing arbitrarily among valid interpretations) or ungracefully (by producing incoherent behavior) when this assumption is violated.
Grounding failures. The language encoder may associate "cup" with visual concepts of cups, but grounding can fail in subtle ways. An instruction to "pick up the Nalgene" requires knowing what a Nalgene is and recognizing one visually — grounding that depends on the breadth of the training data for both the language encoder and the robot policy. Objects outside the distribution of either component will fail to ground.
Compositionality challenges. While language is compositional, learned policies often are not. A policy that can "pick up the red block" and "place on the green circle" individually may fail at "pick up the red block and place it on the green circle" because the conjunction of two skills requires temporal coordination that neither demonstration alone provides.
Underspecification. Language often leaves critical details unspecified. "Put the cup on the shelf" does not specify which shelf, where on the shelf, in what orientation, or how fast. The policy must fill in these details from learned priors, which may not match the user's intent. This is a fundamental mismatch between the low-dimensional information content of a short instruction and the high-dimensional action sequence needed to execute it.
Temporal abstraction. A single language instruction may correspond to hundreds or thousands of low-level actions. "Make a sandwich" is a temporally extended task involving dozens of subtasks. Current language-conditioned policies work best for short-horizon tasks (10–50 actions) and require hierarchical decomposition (as in SayCan) for longer tasks.
Active research is addressing these limitations through several directions: interactive language grounding (asking clarifying questions when instructions are ambiguous), program synthesis (converting language to executable code rather than direct motor commands), hierarchical language conditioning (different language embeddings at different temporal scales), and multimodal instructions (combining language with pointing gestures, sketches, or goal images to resolve ambiguity).
Code Examples
These code examples illustrate the core mechanisms discussed in this article: FiLM conditioning, a minimal language-conditioned policy, and multi-task data loading with language labels.
FiLM conditioning layer
import torch
import torch.nn as nn
class FiLMLayer(nn.Module):
"""Feature-wise Linear Modulation (Perez et al., 2018).
Modulates visual feature channels using scale (gamma) and shift (beta)
parameters predicted from a language embedding.
"""
def __init__(self, num_channels: int, language_dim: int):
super().__init__()
# Project language embedding to per-channel scale and shift
self.gamma_proj = nn.Linear(language_dim, num_channels)
self.beta_proj = nn.Linear(language_dim, num_channels)
# Initialize to identity transform (gamma=1, beta=0)
nn.init.ones_(self.gamma_proj.bias)
nn.init.zeros_(self.gamma_proj.weight)
nn.init.zeros_(self.beta_proj.bias)
nn.init.zeros_(self.beta_proj.weight)
def forward(self, visual_features: torch.Tensor, language_emb: torch.Tensor):
"""
Args:
visual_features: (B, C, H, W) convolutional feature map
language_emb: (B, D) language embedding vector
Returns:
Modulated features: (B, C, H, W)
"""
gamma = self.gamma_proj(language_emb).unsqueeze(-1).unsqueeze(-1) # (B, C, 1, 1)
beta = self.beta_proj(language_emb).unsqueeze(-1).unsqueeze(-1) # (B, C, 1, 1)
return gamma * visual_features + beta
class FiLMConditionedEncoder(nn.Module):
"""Vision encoder with FiLM conditioning at each block (simplified RT-1 style)."""
def __init__(self, language_dim: int = 512):
super().__init__()
# Simplified 3-block CNN (real RT-1 uses EfficientNet-B3)
self.block1 = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
nn.Conv2d(64, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(2)
)
self.film1 = FiLMLayer(64, language_dim)
self.block2 = nn.Sequential(
nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
nn.MaxPool2d(2)
)
self.film2 = FiLMLayer(128, language_dim)
self.block3 = nn.Sequential(
nn.Conv2d(128, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(),
nn.MaxPool2d(2)
)
self.film3 = FiLMLayer(256, language_dim)
def forward(self, image: torch.Tensor, language_emb: torch.Tensor):
"""Returns FiLM-conditioned visual features."""
h = self.block1(image)
h = self.film1(h, language_emb)
h = self.block2(h)
h = self.film2(h, language_emb)
h = self.block3(h)
h = self.film3(h, language_emb)
return h # (B, 256, H/8, W/8)
Language-conditioned policy
import torch
import torch.nn as nn
from transformers import CLIPModel, CLIPTokenizer
class LanguageConditionedPolicy(nn.Module):
"""Minimal language-conditioned policy.
Encodes language instructions with CLIP, conditions a vision encoder
via FiLM, and predicts continuous actions.
"""
def __init__(self, action_dim: int = 7, freeze_clip: bool = True):
super().__init__()
# Language encoder (frozen CLIP text encoder)
self.clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
if freeze_clip:
for param in self.clip.parameters():
param.requires_grad = False
language_dim = 512 # CLIP ViT-B/32 text embedding dimension
# FiLM-conditioned vision encoder
self.encoder = FiLMConditionedEncoder(language_dim=language_dim)
# Action prediction head
self.action_head = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, action_dim),
nn.Tanh() # Actions in [-1, 1], rescaled to actual limits
)
def encode_language(self, instructions: list[str]) -> torch.Tensor:
"""Encode a batch of language instructions to CLIP embeddings."""
tokens = self.tokenizer(
instructions, padding=True, truncation=True,
max_length=77, return_tensors="pt"
).to(next(self.parameters()).device)
with torch.no_grad():
text_features = self.clip.get_text_features(**tokens)
return text_features / text_features.norm(dim=-1, keepdim=True)
def forward(self, image: torch.Tensor, instructions: list[str]):
"""
Args:
image: (B, 3, 224, 224) RGB image
instructions: list of B language instruction strings
Returns:
actions: (B, action_dim) predicted actions
"""
language_emb = self.encode_language(instructions)
visual_features = self.encoder(image, language_emb)
actions = self.action_head(visual_features)
return actions
# Usage example
policy = LanguageConditionedPolicy(action_dim=7)
image = torch.randn(2, 3, 224, 224)
instructions = ["pick up the red cup", "push the blue box to the left"]
actions = policy(image, instructions)
print(f"Predicted actions shape: {actions.shape}") # (2, 7)
Multi-task data loading with language labels
import torch
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
import numpy as np
from pathlib import Path
import json
class MultiTaskRobotDataset(Dataset):
"""Dataset for multi-task language-conditioned imitation learning.
Each demonstration is stored as a directory containing:
- metadata.json: {"task": "pick up red cup", "num_steps": 50}
- observations/: images as .npy files
- actions/: action vectors as .npy files
"""
def __init__(self, data_root: str, task_augmentation: bool = True):
self.data_root = Path(data_root)
self.task_augmentation = task_augmentation
# Index all demonstrations
self.episodes = []
for demo_dir in sorted(self.data_root.iterdir()):
if not demo_dir.is_dir():
continue
meta = json.loads((demo_dir / "metadata.json").read_text())
for step in range(meta["num_steps"]):
self.episodes.append({
"dir": demo_dir,
"step": step,
"task": meta["task"]
})
# Build task vocabulary for balanced sampling
self.tasks = list(set(ep["task"] for ep in self.episodes))
self.task_counts = {t: sum(1 for ep in self.episodes if ep["task"] == t)
for t in self.tasks}
# Paraphrase templates for augmentation
self.paraphrase_templates = {
"pick up": ["grasp", "grab", "lift", "take"],
"push": ["slide", "move", "shove"],
"place": ["put", "set down", "position"],
}
def augment_instruction(self, instruction: str) -> str:
"""Replace verbs with synonyms for language augmentation."""
if not self.task_augmentation:
return instruction
for verb, synonyms in self.paraphrase_templates.items():
if verb in instruction:
replacement = np.random.choice([verb] + synonyms)
return instruction.replace(verb, replacement)
return instruction
def get_balanced_sampler(self) -> WeightedRandomSampler:
"""Create a sampler that balances across tasks."""
weights = [1.0 / self.task_counts[ep["task"]] for ep in self.episodes]
return WeightedRandomSampler(weights, len(weights))
def __len__(self):
return len(self.episodes)
def __getitem__(self, idx):
ep = self.episodes[idx]
obs = np.load(ep["dir"] / f"observations/{ep['step']:04d}.npy")
action = np.load(ep["dir"] / f"actions/{ep['step']:04d}.npy")
instruction = self.augment_instruction(ep["task"])
return {
"observation": torch.from_numpy(obs).float(),
"action": torch.from_numpy(action).float(),
"instruction": instruction
}
def collate_fn(batch):
"""Custom collate that handles string instructions."""
return {
"observation": torch.stack([b["observation"] for b in batch]),
"action": torch.stack([b["action"] for b in batch]),
"instruction": [b["instruction"] for b in batch]
}
The paraphrase augmentation in the data loader above is a simplified version of a technique used in BC-Z and RT-1. By training the policy on multiple phrasings of the same instruction ("pick up the cup," "grasp the cup," "grab the cup"), we encourage the policy to rely on the semantic content of the embedding rather than overfitting to specific word sequences. In practice, RT-1 used template-based augmentation to expand each instruction into 10+ paraphrases, significantly improving generalization.
References
Seminal papers and key works referenced in this article.
- Shridhar et al. "CLIPort: What and Where Pathways for Robotic Manipulation." CoRL, 2022. arXiv
- Jang et al. "BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning." CoRL, 2022. arXiv
- Ahn et al. "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." CoRL, 2022. arXiv
- Reed et al. "A Generalist Agent." TMLR, 2022. arXiv
- Lynch & Sermanet. "Language Conditioned Imitation Learning over Unstructured Data." RSS, 2021. arXiv