120 kitchen scenes, 2500+ objects, 100 tasks, and a pipeline that turns 1250 human demos into 100K+ trajectories — making data the cheap part of robot learning.
In 2017, AlphaGo trained by playing millions of games against itself. In 2020, GPT-3 trained on hundreds of billions of words scraped from the internet. Both fields had one thing in common: when you need more data, you just get more data.
Robotics doesn't have that luxury.
Getting a robot to pick up a cup requires a human to physically operate the robot, place the cup in front of it, run the demo, reset, and repeat. One trajectory takes 2–5 minutes of human time. At that rate, collecting 100,000 trajectories would take a team of operators running continuously for years. And unlike internet text, robotic data can't be scraped — it has to be physically enacted in the real world.
The scale difference between robotics and other ML fields isn't marginal — it's catastrophic. Language models train on trillions of tokens. Vision models train on billions of images. The largest publicly available robot datasets have tens of thousands of trajectories. That's a gap of four to five orders of magnitude.
This gap matters because scaling laws are real. More data reliably produces better policies — if you can get the data. The field needs a way to generate robot data cheaply, at scale, without sacrificing the physics realism that makes the data useful.
Simulation-based robot learning has been tried before. The failures are consistent: simulated environments are too simple (a single tabletop, five objects), tasks are too scripted (stack block A on block B), and the resulting policies fail in real kitchens because they've never seen a cabinet door, a coffee mug among 50 similar objects, or a floor plan wider than two meters.
Domain gap — the mismatch between sim and reality — is the central enemy. The solution isn't to cheat (add sim-to-real tricks to a bad simulator). The solution is to build a simulator realistic and diverse enough that the gap shrinks to begin with.
Drag the "Cost per trajectory" slider to see how quickly real-world collection becomes prohibitive compared to simulation.
The kitchen is not an arbitrary choice. It is the hardest domestic environment and the most universal one. Every home has a kitchen. A robot that can navigate and manipulate in a kitchen — opening drawers, placing groceries, brewing coffee — can handle most household tasks.
RoboCasa builds its simulation on top of RoboSuite, a MuJoCo-based robot simulator. What it adds is scale: 10 distinct floor plans crossed with 12 architectural styles produces 120 unique base scenes, before any randomization.
The 10 layouts range from a one-wall kitchen (single strip of cabinets and appliances along one wall — simplest for navigation) to a G-shaped kitchen (counters on three sides plus an island — requires precise spatial reasoning to navigate without collision). Between them are L-shaped, U-shaped, galley, and peninsula configurations.
This variety is deliberate. A policy trained only on a one-wall kitchen learns to always move in a straight line. When placed in a U-shaped kitchen, it collides with the counter on the opposite wall. Floor plan diversity forces policies to learn generalizable navigation.
The same L-shaped floor plan appears in 12 distinct styles: Scandinavian (clean white surfaces, light wood), Mediterranean (terracotta tiles, warm tones), Industrial (exposed metal, dark surfaces), Farmhouse (reclaimed wood, apron sink), and eight more. Each style changes wall color, cabinet finish, counter material, and appliance appearance.
Beyond the 12 base styles, RoboCasa adds domain randomization via 400 AI-generated textures — 100 each for walls, floors, counters, and cabinet panels. During training, each episode samples different textures, preventing policies from relying on color as a shortcut.
Every appliance in every kitchen is fully interactive with MuJoCo's physics engine. Cabinet doors swing on hinges. Drawer slides have friction. Microwave doors open and close. Stove knobs rotate — and turning one past its threshold triggers the corresponding burner. Coffee machine buttons depress and trigger brewing animations. None of this is scripted: the robot's gripper must physically interact with each surface correctly.
Click a layout to see its floor plan shape. Each can appear in 12 different architectural styles.
A kitchen without objects is not a kitchen. RoboCasa populates its scenes with 2509 unique 3D objects across 153 categories. That's enough variety that the policy can't memorize appearances — it must generalize.
The objects come from two sources, combined deliberately. Objaverse contributes 917 objects — a curated set from the community 3D object dataset, manually selected for kitchen relevance and physical plausibility. Luma.ai text-to-3D generation contributes 1592 objects — the majority of the dataset.
The Luma.ai pipeline takes a text description ("red apple, slightly bruised") and generates a textured 3D mesh. RoboCasa's team writes category-level prompts ("generate 20 distinct coffee mugs with varied colors, handles, and sizes"), runs generation, then manually filters out objects with broken geometry, incorrect scale, or implausible physics properties.
The filtering step matters. Text-to-3D models are imperfect: they produce floating geometry, inside-out normals, and physically impossible concavities. Every generated object passes a human review before entering the dataset. The result is a large, diverse, physically usable object library that would have taken years to model by hand.
The categories span everything a kitchen actually contains. Produce: apples, bananas, bell peppers, 12 more vegetables. Dairy: milk cartons, cheese wedges, yogurt containers. Proteins: chicken breasts, fish fillets, eggs. Drinks: soda cans, juice bottles, wine glasses. Receptacles: bowls, plates, pots, pans, Tupperware. Tools: spatulas, ladles, measuring cups. Packaged goods: cereal boxes, canned goods, condiment bottles.
Having 153 categories forces generalization across shape and size. A pick-and-place policy can't learn "pick up the object shaped like a cylinder" — it must learn something more general about grasping affordances.
Beyond objects, RoboCasa generates 400 surface textures using MidJourney image synthesis: 100 wall paint colors/patterns, 100 floor tile/hardwood designs, 100 counter materials (marble, granite, laminate, butcher block), and 100 cabinet panel finishes. These are applied during episode initialization, so every rollout sees a different visual kitchen configuration.
Click a category group to see example objects and their source. Note the count difference between Objaverse and AI-generated assets.
A kitchen scene with 2500 objects and beautiful physics is useless without tasks. What should the robot actually do? RoboCasa answers this with a two-level task hierarchy: 25 atomic tasks covering fundamental manipulation skills, and 75 composite tasks that chain atomic skills into realistic kitchen workflows.
The atomic tasks are the building blocks. Each tests exactly one manipulation primitive:
Each atomic task is fully defined by a reward function, a success condition, and an object placement randomization range. The robot succeeds when the success condition is met — and never before. No partial credit.
The composite tasks are where things get interesting. Rather than hand-scripting every kitchen workflow, RoboCasa uses GPT-4 and Gemini 1.5 as brainstorming partners. The pipeline has four steps:
PrepareCoffee: open the cabinet → pick up a mug → place mug under coffee machine → press the brew button. Four atomic skills chained in sequence. Failure at any step cascades — a mug not placed correctly means the coffee brews on the counter.
ArrangeVegetables: pick vegetables from the sink basin → place each on the cutting board in the correct region. Requires discriminating between objects (only the vegetables, not the dish soap), precise placement, and multiple sequential pick-and-place cycles.
StorePantryItem: open the pantry door → pick a canned good from the counter → place it on the correct shelf → close the door. Involves navigation, door manipulation, and placement with depth reasoning.
Click a composite task to see which atomic skills it chains together. The same skill appears in many composite tasks.
You have 25 atomic tasks and a kitchen simulator. Now you need data. RoboCasa collects 1250 human demonstrations — 50 per task, performed by 4 operators using a SpaceMouse 6-DOF teleoperation device. That's a decent start. It's not enough to train a strong policy.
The key insight: human demos are seeds, not the final dataset. MimicGen is the automated pipeline that takes those 1250 seeds and transforms them into 100,000+ trajectories — an 80× amplification — without a single additional human in the loop.
A human demo is recorded as a sequence of end-effector poses: where the gripper was at each timestep. The naive approach would replay this pose sequence in a new scene configuration. The problem: if the apple moved 10cm to the left, the replay trajectory reaches into empty air.
MimicGen's solution is object-centric decomposition. It breaks the demo into segments, where each segment is defined relative to the object being manipulated during that segment — not relative to the world frame.
Take a pick-and-place demo: the gripper approaches the apple, grasps it, lifts it, moves to the bowl, and releases. MimicGen identifies two interaction points: the grasp event and the release event. These divide the trajectory into segments:
Storing segments in object-local frames is the key. The "approach the apple" motion is the same regardless of where the apple is in the kitchen — you just need to know where the apple is, then apply the stored local-frame trajectory relative to that location.
To generate a new trajectory: place the objects at new random positions (within the task's valid region), look up the stored local-frame segments, transform each segment back to world coordinates using the new object pose, and stitch the segments together into a complete trajectory candidate.
The stitching adds a short interpolation between segments to ensure the end-effector makes a smooth transition. The candidate trajectory is then executed in the simulator. If it succeeds — the task completion condition is met — it's added to the dataset. If it fails (collision, dropped object, timeout), it's discarded. This rejection sampling ensures only valid trajectories enter the training set.
Step through the pipeline. Watch a human demo get decomposed into segments, transformed to new object positions, and validated. Use the slider to generate more trajectories.
| Stage | Count | Human time |
|---|---|---|
| Task design (25 atomic) | 25 tasks | ~200 engineer-hours |
| Human demonstrations | 1,250 demos (50/task) | ~60 operator-hours |
| MimicGen generation | 100,000+ trajectories | ~0 (automated) |
| Amplification factor | 80× | — |
python — MimicGen generation loop (simplified) def generate_trajectory(human_demo, new_object_poses, env): # 1. Decompose human demo into object-centric segments segments = decompose_demo(human_demo) # segments = [{frame: 'apple_local', traj: [...]}, {frame: 'bowl_local', traj: [...]}] # 2. Transform each segment to world frame using new object poses world_traj = [] for seg in segments: T_obj = new_object_poses[seg['frame']] # 4×4 transform matrix world_seg = transform_segment(seg['traj'], T_obj) world_traj.extend(world_seg) # 3. Stitch segments with smooth interpolation stitched = interpolate_transitions(world_traj) # 4. Execute in simulator — rejection sampling result = env.rollout(stitched) if result['success']: return stitched # keep this trajectory return None # discard and try again with new poses
One of robotics' most inconvenient truths: a policy trained for a Franka Panda arm is useless on a Boston Dynamics Spot arm. Different joint counts, different kinematics, different sensor setups. Most simulators are built around one specific robot. Policies stay locked to that hardware.
RoboCasa is designed from the start to support multiple embodiments — different robot platforms — all operating in the same kitchen environments, on the same tasks, interacting with the same objects.
The primary platform is Omni-Frankie: a Franka Panda 7-DOF arm mounted on an Omron mobile base. This is a mobile manipulator — the base can drive around the kitchen while the arm performs manipulation. It's the configuration closest to real-world household robot deployments.
The second platform is a humanoid robot — a bipedal form factor with two arms. Humanoids are increasingly commercial (Figure, Agility Robotics, Boston Dynamics Atlas), and their form factor lets them interact with kitchen environments designed for humans: same counter heights, same drawer handles, same appliance layouts.
The third platform is a quadruped with an arm — a legged robot (like Spot) with a manipulator. Quadrupeds have the most stable locomotion in unstructured environments but face additional challenges: the arm must compensate for the body's motion during locomotion.
All three robots operate in the same kitchen, on the same tasks. The table is at the same height. The apple is in the same place. The success condition is identical. What differs is the action space: the humanoid controls 28 joints; Omni-Frankie controls 8 (7 arm + 1 base); the quadruped controls 16 (12 leg + 4 arm).
This forces policies to be embodiment-aware. A policy conditioned on the robot's joint configuration and the task description can — in principle — generalize across embodiments. RoboCasa's shared environment is the testbed where this hypothesis can be empirically evaluated.
Click each robot type to compare its action space, mobility, and key capabilities in the RoboCasa kitchen.
The central empirical claim of RoboCasa: more generated data produces better policies, and the trend is consistent across dataset sizes. This isn't assumed — it's measured directly by training the same policy architecture on datasets of increasing size and evaluating on the same 25 atomic tasks.
All experiments use BC-Transformer from RoboMimic — a standard behavior cloning architecture with a transformer backbone. Inputs: camera images (agentview + wrist) + robot proprioception. Output: end-effector delta poses. Multi-task learning: one model trained jointly on all 25 atomic tasks simultaneously. Evaluation: 50 rollouts per task, averaged across all tasks.
| Dataset | Demos/task | Total demos | Avg success |
|---|---|---|---|
| Human-50 (baseline) | 50 | 1,250 | 28.8% |
| Generated-100 | 100 | 2,400 | 34.2% |
| Generated-300 | 300 | 7,200 | 39.8% |
| Generated-3000 | 3,000 | 72,000 | 47.6% |
Each step up roughly doubles the success rate relative to the gap from 100%. From 28.8% to 47.6% is an 18.8 percentage point improvement — a 65% relative increase — from data that costs human time only to set up the pipeline, not to collect.
Averaging across all 25 tasks obscures important structure. Some skills are nearly solved at 72K demos; others are barely begun. The hierarchy of difficulty roughly follows the precision required:
Easiest (~60–70%): open/close doors and drawers. The handle is easy to localize visually; the motion is smooth and repeatable. MimicGen transfers well because the door's local frame fully captures the motion.
Medium (~40–50%): press buttons, twist knobs. Require precise endpoint positioning but the motion is short. Visual localization of buttons is harder when multiple buttons are nearby.
Hardest (~20–35%): pick-and-place, especially with small or round objects. Grasping requires aligning the gripper precisely with an object whose orientation may vary. The success condition (object in target zone) is strict.
Very hard (~10–20%): insertion tasks. Fitting an object into a slot requires sub-centimeter precision. Even 72K demos leaves significant room for improvement.
Drag to explore the scaling trend. Toggle between overall average and individual skill categories.
Atomic tasks are hard. Composite tasks are brutal. Where an atomic task requires one skill executed correctly, a composite task requires 3–6 skills executed correctly in sequence. Any failure cascades: a mug not placed under the coffee machine means pressing the brew button succeeds in simulation but fails in purpose.
Training BC-Transformer from scratch on 50 human demonstrations per composite task (the same setup that gives 28.8% on atomic tasks) produces near-zero performance on most composite tasks. On 4 out of 5 evaluated tasks, the success rate is literally 0%. On the fifth (a shorter, simpler chain), it's 2%.
This isn't surprising — it follows directly from error compounding. If each of 4 sub-skills has a 70% success rate (optimistic for a 50-demo policy), the probability of completing all 4 in sequence is 0.7⁴ = 24%. If each sub-skill is at 50%, the chain success is 0.5⁴ = 6.25%. Composite tasks amplify every individual skill weakness.
The key finding: pre-training a policy on all 25 atomic tasks (using the 72K generated dataset) and then fine-tuning on composite task demos substantially improves performance. ArrangeVegetables goes from 2% (scratch) to 12% (fine-tuned). PrepareCoffee goes from 0% to 6%.
Why does this work? The atomic pre-training gives the policy strong primitives — reliable grasping, precise placement, correct button pressing. The fine-tuning on composite demos teaches the policy how to chain these primitives and recover from partial failures. The policy isn't learning grasping from scratch during composite training — it already knows how to grasp.
Even 12% success on ArrangeVegetables means failure 88% of the time. The authors are explicit: composite task performance is a major open problem, not a solved one. RoboCasa's contribution is not solving composite tasks — it's providing the benchmark and the data infrastructure for the field to make progress on them.
Compare training from scratch vs. fine-tuning from the atomic task pre-trained model across 5 composite tasks.
All the work so far — 120 kitchen scenes, 2509 objects, 100 tasks, 100K trajectories — is worthless if the policies don't transfer to reality. A robot that works in simulation but fails on a real kitchen counter is not a useful robot.
RoboCasa's sim-to-real experiment is direct: does co-training with sim data improve policies trained on real-world demonstrations?
The real-world setup is a physical kitchen equipped with a Franka Panda arm on DROID hardware — the same physical robot platform used by several major research labs. Three pick-and-place tasks are evaluated: pick tomato from bowl, pick can from cabinet, pick bread from plate.
For each task, 50 real-world demonstrations are collected by human operators. This is the real-only baseline: a policy trained purely on 50 real demos, evaluated on the same task with both seen objects (training objects) and unseen objects (held-out objects the policy never saw during training).
The real+sim condition adds RoboCasa simulation data to the training mix. The policy is co-trained on the combined dataset — real demos for their physical realism, sim demos for diversity and volume.
| Condition | Seen objects | Unseen objects |
|---|---|---|
| Real-only (50 demos) | 13.6% | 2.6% |
| Real + Sim (co-trained) | 24.4% | 9.3% |
| Relative improvement | +79% | +258% |
The seen-object improvement (13.6% → 24.4%) is significant but the unseen-object improvement is striking: 2.6% → 9.3%, a 258% relative increase. This is the clearest signal that RoboCasa's diversity is doing real work. The sim data exposes the policy to 2509 different objects; the policy learns more general grasping representations rather than memorizing the appearance of the 3 training objects.
The honest answer is: imperfectly, and for specific reasons. RoboCasa's physics (MuJoCo contact dynamics) is not identical to real-world physics. Colors and textures in sim differ from reality. But enough features transfer: relative object positions, approach angles, gripper closure timing.
Domain randomization is the key mechanism. Because training sees 400 different textures and 2509 different objects, the policy cannot rely on any specific visual feature. It must learn invariances — features that predict success across all these variations. Those invariances are more likely to transfer to the real world than features specific to one simulated kitchen.
Compare real-only vs. co-trained policy on seen and unseen objects. Click a bar to see the numbers.
RoboCasa sits at the intersection of several active research streams. Understanding where it fits helps you know which papers to read next and which limitations to watch for.
RoboCasa is built directly on RoboSuite (Zhu et al., 2020) — the MuJoCo-based robot simulation framework from NVIDIA Research and UT Austin that most of the same authors developed. RoboSuite provides the physics engine, the robot models, and the task infrastructure. RoboCasa adds scale: larger environments, more objects, more tasks, the generative AI pipeline.
MimicGen (Mandlekar et al., 2023) is the trajectory augmentation backbone. RoboCasa uses MimicGen as a component — the contribution is not MimicGen itself but combining MimicGen with a large enough environment that the generated data is meaningfully diverse.
| Simulator | Scale | Physics | Dataset | Key limitation |
|---|---|---|---|---|
| RoboCasa | 120 scenes, 2509 obj | MuJoCo (high) | 100K+ demos | Kitchen only |
| MetaWorld | 50 tasks, minimal scenes | MuJoCo | Small | Tabletop only, no real objects |
| LIBERO | 4 scene suites | MuJoCo | ~500 demos | Small dataset, limited object diversity |
| AI2-THOR | 120 scenes | Unity (lower fidelity) | None | No large-scale dataset, lower physics fidelity |
| Habitat 3.0 | Large-scale | Bullet | Navigation focus | Limited manipulation tasks |
Vision-Language-Action models like pi-0 and OpenVLA are hungry for diverse robot data. Their performance scales with dataset diversity and volume — exactly what RoboCasa provides. The practical path forward: use RoboCasa's 100K sim trajectories as a pre-training corpus, then fine-tune on real-world task demonstrations. This is the co-training recipe from Chapter 8, applied at VLA scale.
The kitchen is also an ideal domain for language grounding. "Put the apple in the bowl" is a natural-language instruction that maps directly to a pick-and-place task. "Brew me a coffee" maps to PrepareCoffee. RoboCasa's tasks are natural candidates for language-conditioned policy training.
Composite task performance. 12% on ArrangeVegetables is not deployment-ready. Long-horizon task completion remains an open problem even with unlimited simulation data.
Domain gap persistence. 24.4% real-world success on pick-and-place is still failure 75% of the time. For household deployment, we need 95%+. Better sim-to-real methods (adaptive domain randomization, learned perception modules) are needed.
Kitchen-only scope. The real world has living rooms, bathrooms, garages. A generalist household robot needs simulation environments beyond the kitchen — RoboCasa's methodology will need to be applied to these spaces.
"What I cannot create, I do not understand." — Richard Feynman. In robotics: what a robot cannot simulate, it cannot learn. RoboCasa is an attempt to close that gap.