RoboCasa: Large-Scale Simulation for Generalist Robots

Chapter 0: The Problem

In 2017, AlphaGo trained by playing millions of games against itself. In 2020, GPT-3 trained on hundreds of billions of words scraped from the internet. Both fields had one thing in common: when you need more data, you just get more data.

Robotics doesn't have that luxury.

Getting a robot to pick up a cup requires a human to physically operate the robot, place the cup in front of it, run the demo, reset, and repeat. One trajectory takes 2–5 minutes of human time. At that rate, collecting 100,000 trajectories would take a team of operators running continuously for years. And unlike internet text, robotic data can't be scraped — it has to be physically enacted in the real world.

The data gap is enormous

The scale difference between robotics and other ML fields isn't marginal — it's catastrophic. Language models train on trillions of tokens. Vision models train on billions of images. The largest publicly available robot datasets have tens of thousands of trajectories. That's a gap of four to five orders of magnitude.

This gap matters because scaling laws are real. More data reliably produces better policies — if you can get the data. The field needs a way to generate robot data cheaply, at scale, without sacrificing the physics realism that makes the data useful.

The hypothesis: If simulation is realistic enough (physics + rendering + object diversity), policies trained in sim will transfer to the real world. And once built, a simulator generates unlimited data for free. RoboCasa is a bet on this hypothesis.

Why simulation is hard to get right

Simulation-based robot learning has been tried before. The failures are consistent: simulated environments are too simple (a single tabletop, five objects), tasks are too scripted (stack block A on block B), and the resulting policies fail in real kitchens because they've never seen a cabinet door, a coffee mug among 50 similar objects, or a floor plan wider than two meters.

Domain gap — the mismatch between sim and reality — is the central enemy. The solution isn't to cheat (add sim-to-real tricks to a bad simulator). The solution is to build a simulator realistic and diverse enough that the gap shrinks to begin with.

The Data Scaling Gap: Robotics vs. Other Fields

Drag the "Cost per trajectory" slider to see how quickly real-world collection becomes prohibitive compared to simulation.

Cost per real trajectory (min) 3 min

Why can't robotics simply "scrape more data" the way NLP does?

Robots generate too much data — the bottleneck is storage Neural networks can't process physical sensor data efficiently Robot trajectories must be physically enacted in the real world — you can't mine them from the internet, so each demo costs minutes of human operator time

Chapter 1: The Kitchen Simulation

The kitchen is not an arbitrary choice. It is the hardest domestic environment and the most universal one. Every home has a kitchen. A robot that can navigate and manipulate in a kitchen — opening drawers, placing groceries, brewing coffee — can handle most household tasks.

RoboCasa builds its simulation on top of RoboSuite, a MuJoCo-based robot simulator. What it adds is scale: 10 distinct floor plans crossed with 12 architectural styles produces 120 unique base scenes, before any randomization.

10 floor plans: geometry matters

The 10 layouts range from a one-wall kitchen (single strip of cabinets and appliances along one wall — simplest for navigation) to a G-shaped kitchen (counters on three sides plus an island — requires precise spatial reasoning to navigate without collision). Between them are L-shaped, U-shaped, galley, and peninsula configurations.

This variety is deliberate. A policy trained only on a one-wall kitchen learns to always move in a straight line. When placed in a U-shaped kitchen, it collides with the counter on the opposite wall. Floor plan diversity forces policies to learn generalizable navigation.

12 architectural styles: visual diversity

The same L-shaped floor plan appears in 12 distinct styles: Scandinavian (clean white surfaces, light wood), Mediterranean (terracotta tiles, warm tones), Industrial (exposed metal, dark surfaces), Farmhouse (reclaimed wood, apron sink), and eight more. Each style changes wall color, cabinet finish, counter material, and appliance appearance.

Beyond the 12 base styles, RoboCasa adds domain randomization via 400 AI-generated textures — 100 each for walls, floors, counters, and cabinet panels. During training, each episode samples different textures, preventing policies from relying on color as a shortcut.

Interactable appliances with realistic physics

Every appliance in every kitchen is fully interactive with MuJoCo's physics engine. Cabinet doors swing on hinges. Drawer slides have friction. Microwave doors open and close. Stove knobs rotate — and turning one past its threshold triggers the corresponding burner. Coffee machine buttons depress and trigger brewing animations. None of this is scripted: the robot's gripper must physically interact with each surface correctly.

Why physics fidelity matters: If a drawer opens the instant the robot touches it (scripted), the policy learns "touch the drawer handle." If the drawer has real friction and mass, the policy must learn to pull with the right force and trajectory. Real kitchens have friction. The policy needs to learn from real physics.

Kitchen Layout Gallery

Click a layout to see its floor plan shape. Each can appear in 12 different architectural styles.

Select a layout above.

RoboCasa generates 120 base kitchen scenes. How?

By manually designing 120 unique kitchens By crossing 10 floor plans with 12 architectural styles (10 × 12 = 120) By using a generative model to hallucinate new kitchen layouts on the fly

Chapter 2: AI-Generated Assets

A kitchen without objects is not a kitchen. RoboCasa populates its scenes with 2509 unique 3D objects across 153 categories. That's enough variety that the policy can't memorize appearances — it must generalize.

The objects come from two sources, combined deliberately. Objaverse contributes 917 objects — a curated set from the community 3D object dataset, manually selected for kitchen relevance and physical plausibility. Luma.ai text-to-3D generation contributes 1592 objects — the majority of the dataset.

Text-to-3D pipeline: how it works

The Luma.ai pipeline takes a text description ("red apple, slightly bruised") and generates a textured 3D mesh. RoboCasa's team writes category-level prompts ("generate 20 distinct coffee mugs with varied colors, handles, and sizes"), runs generation, then manually filters out objects with broken geometry, incorrect scale, or implausible physics properties.

The filtering step matters. Text-to-3D models are imperfect: they produce floating geometry, inside-out normals, and physically impossible concavities. Every generated object passes a human review before entering the dataset. The result is a large, diverse, physically usable object library that would have taken years to model by hand.

153 categories: what's in the kitchen

The categories span everything a kitchen actually contains. Produce: apples, bananas, bell peppers, 12 more vegetables. Dairy: milk cartons, cheese wedges, yogurt containers. Proteins: chicken breasts, fish fillets, eggs. Drinks: soda cans, juice bottles, wine glasses. Receptacles: bowls, plates, pots, pans, Tupperware. Tools: spatulas, ladles, measuring cups. Packaged goods: cereal boxes, canned goods, condiment bottles.

Having 153 categories forces generalization across shape and size. A pick-and-place policy can't learn "pick up the object shaped like a cylinder" — it must learn something more general about grasping affordances.

400 textures via MidJourney

Beyond objects, RoboCasa generates 400 surface textures using MidJourney image synthesis: 100 wall paint colors/patterns, 100 floor tile/hardwood designs, 100 counter materials (marble, granite, laminate, butcher block), and 100 cabinet panel finishes. These are applied during episode initialization, so every rollout sees a different visual kitchen configuration.

Why generative AI for assets? Traditional 3D modeling takes 4–8 hours per object from a skilled artist. At $50/hr, 2509 objects would cost $600K–$1.2M in modeling fees alone. Text-to-3D generation with human filtering achieves comparable quality in a fraction of the time and cost — making large-scale diversity economically feasible.

Object Category Browser

Click a category group to see example objects and their source. Note the count difference between Objaverse and AI-generated assets.

Select a category above.

What is the primary source of RoboCasa's 2509 objects, and why is human filtering still necessary?

Luma.ai text-to-3D (majority source); text-to-3D models produce broken geometry and physically implausible objects that must be manually rejected Objaverse (majority source); community-contributed models have inconsistent licensing Both sources contribute equally; filtering removes duplicate objects

Chapter 3: 100 Tasks — Atomic and Composite

A kitchen scene with 2500 objects and beautiful physics is useless without tasks. What should the robot actually do? RoboCasa answers this with a two-level task hierarchy: 25 atomic tasks covering fundamental manipulation skills, and 75 composite tasks that chain atomic skills into realistic kitchen workflows.

25 atomic tasks across 8 core skills

The atomic tasks are the building blocks. Each tests exactly one manipulation primitive:

Pick & place — grab an object and put it somewhere else (6 variants)
Open/close doors — swing a cabinet or refrigerator door (4 variants)
Open/close drawers — slide a drawer open or shut (2 variants)
Twist knobs — rotate a stove knob to a target angle (3 variants)
Turn levers — actuate a lever-style handle (2 variants)
Press buttons — depress a microwave or coffee machine button (3 variants)
Insertion — fit an object into a specific slot or container (3 variants)
Navigation — move the mobile base to a target region (2 variants)

Each atomic task is fully defined by a reward function, a success condition, and an object placement randomization range. The robot succeeds when the success condition is met — and never before. No partial credit.

75 composite tasks via LLM guidance

The composite tasks are where things get interesting. Rather than hand-scripting every kitchen workflow, RoboCasa uses GPT-4 and Gemini 1.5 as brainstorming partners. The pipeline has four steps:

Step 1: Activity prompting

Prompt GPT-4: "List 20 realistic kitchen activities a household robot might perform." Output: brewing coffee, washing dishes, preparing a salad, storing groceries, setting the table...

↓

Step 2: Task proposals

For each activity, prompt: "Decompose this into specific robot manipulation steps, each achievable as a sequence of atomic skills." Output: specific object names, skill sequences, success conditions.

↓

Step 3: Human filtering

Robotics engineers review each proposal for logical consistency, physical plausibility, and testable success conditions. ~40% of LLM proposals pass this filter.

↓

Step 4: Code implementation

Engineers implement each task as a RoboSuite task class — reward function, object placement, success checker. Average: 2–3 hours per composite task.

Example composite tasks

PrepareCoffee: open the cabinet → pick up a mug → place mug under coffee machine → press the brew button. Four atomic skills chained in sequence. Failure at any step cascades — a mug not placed correctly means the coffee brews on the counter.

ArrangeVegetables: pick vegetables from the sink basin → place each on the cutting board in the correct region. Requires discriminating between objects (only the vegetables, not the dish soap), precise placement, and multiple sequential pick-and-place cycles.

StorePantryItem: open the pantry door → pick a canned good from the counter → place it on the correct shelf → close the door. Involves navigation, door manipulation, and placement with depth reasoning.

Why LLM guidance? A human engineer could design 75 composite tasks from scratch, but would be limited by their imagination of what "kitchen tasks" means. LLMs trained on recipe blogs, cooking shows, and household manuals have absorbed a much broader concept of kitchen activity. They propose tasks the engineer wouldn't think of — and those edge cases are exactly what makes the benchmark challenging.

Task Hierarchy Visualizer

Click a composite task to see which atomic skills it chains together. The same skill appears in many composite tasks.

Click a task above to see its skill decomposition.

Why does RoboCasa use LLM guidance (GPT-4, Gemini 1.5) to design composite tasks rather than having engineers design them entirely by hand?

LLMs can automatically write the code for each task, saving implementation time LLMs are required to validate that tasks are physically feasible in simulation LLMs trained on cooking and household content propose a broader, more diverse range of realistic kitchen activities than a small team of engineers would generate alone

Chapter 4: MimicGen — Data Generation at Scale SHOWCASE

You have 25 atomic tasks and a kitchen simulator. Now you need data. RoboCasa collects 1250 human demonstrations — 50 per task, performed by 4 operators using a SpaceMouse 6-DOF teleoperation device. That's a decent start. It's not enough to train a strong policy.

The key insight: human demos are seeds, not the final dataset. MimicGen is the automated pipeline that takes those 1250 seeds and transforms them into 100,000+ trajectories — an 80× amplification — without a single additional human in the loop.

The core idea: object-centric decomposition

A human demo is recorded as a sequence of end-effector poses: where the gripper was at each timestep. The naive approach would replay this pose sequence in a new scene configuration. The problem: if the apple moved 10cm to the left, the replay trajectory reaches into empty air.

MimicGen's solution is object-centric decomposition. It breaks the demo into segments, where each segment is defined relative to the object being manipulated during that segment — not relative to the world frame.

Segment decomposition step-by-step

Take a pick-and-place demo: the gripper approaches the apple, grasps it, lifts it, moves to the bowl, and releases. MimicGen identifies two interaction points: the grasp event and the release event. These divide the trajectory into segments:

Approach segment: everything from start until just before grasp — expressed in the apple's local frame
Transfer segment: from grasp to just before release — expressed in the bowl's local frame
Release segment: the release and retract motion

Storing segments in object-local frames is the key. The "approach the apple" motion is the same regardless of where the apple is in the kitchen — you just need to know where the apple is, then apply the stored local-frame trajectory relative to that location.

Generation: transforming to new object poses

To generate a new trajectory: place the objects at new random positions (within the task's valid region), look up the stored local-frame segments, transform each segment back to world coordinates using the new object pose, and stitch the segments together into a complete trajectory candidate.

The stitching adds a short interpolation between segments to ensure the end-effector makes a smooth transition. The candidate trajectory is then executed in the simulator. If it succeeds — the task completion condition is met — it's added to the dataset. If it fails (collision, dropped object, timeout), it's discarded. This rejection sampling ensures only valid trajectories enter the training set.

Why rejection sampling works: Most generated trajectories succeed. The object-centric approach means the motion is geometrically correct by construction. Failure cases are edge cases: extreme object placements near scene boundaries, or rare collision configurations. Keeping only successes means the training data is clean — no learning from failed demonstrations.

MimicGen Pipeline — Interactive Showcase

Step through the pipeline. Watch a human demo get decomposed into segments, transformed to new object positions, and validated. Use the slider to generate more trajectories.

Generated trajectories 1

Click "1. Human Demo" to start.

The numbers

Stage	Count	Human time
Task design (25 atomic)	25 tasks	~200 engineer-hours
Human demonstrations	1,250 demos (50/task)	~60 operator-hours
MimicGen generation	100,000+ trajectories	~0 (automated)
Amplification factor	80×	—

python — MimicGen generation loop (simplified)
def generate_trajectory(human_demo, new_object_poses, env):
    # 1. Decompose human demo into object-centric segments
    segments = decompose_demo(human_demo)
    # segments = [{frame: 'apple_local', traj: [...]}, {frame: 'bowl_local', traj: [...]}]

    # 2. Transform each segment to world frame using new object poses
    world_traj = []
    for seg in segments:
        T_obj = new_object_poses[seg['frame']]  # 4×4 transform matrix
        world_seg = transform_segment(seg['traj'], T_obj)
        world_traj.extend(world_seg)

    # 3. Stitch segments with smooth interpolation
    stitched = interpolate_transitions(world_traj)

    # 4. Execute in simulator — rejection sampling
    result = env.rollout(stitched)
    if result['success']:
        return stitched   # keep this trajectory
    return None           # discard and try again with new poses

MimicGen expresses trajectory segments in object-local frames rather than world frames. What problem does this solve?

It reduces the file size of stored demonstrations It makes the motion transferable to new object positions — the same local-frame approach trajectory works regardless of where the object is placed in the scene It prevents the robot from colliding with walls during execution

Chapter 5: Cross-Embodiment

One of robotics' most inconvenient truths: a policy trained for a Franka Panda arm is useless on a Boston Dynamics Spot arm. Different joint counts, different kinematics, different sensor setups. Most simulators are built around one specific robot. Policies stay locked to that hardware.

RoboCasa is designed from the start to support multiple embodiments — different robot platforms — all operating in the same kitchen environments, on the same tasks, interacting with the same objects.

Three supported platforms

The primary platform is Omni-Frankie: a Franka Panda 7-DOF arm mounted on an Omron mobile base. This is a mobile manipulator — the base can drive around the kitchen while the arm performs manipulation. It's the configuration closest to real-world household robot deployments.

The second platform is a humanoid robot — a bipedal form factor with two arms. Humanoids are increasingly commercial (Figure, Agility Robotics, Boston Dynamics Atlas), and their form factor lets them interact with kitchen environments designed for humans: same counter heights, same drawer handles, same appliance layouts.

The third platform is a quadruped with an arm — a legged robot (like Spot) with a manipulator. Quadrupeds have the most stable locomotion in unstructured environments but face additional challenges: the arm must compensate for the body's motion during locomotion.

What cross-embodiment means for learning

All three robots operate in the same kitchen, on the same tasks. The table is at the same height. The apple is in the same place. The success condition is identical. What differs is the action space: the humanoid controls 28 joints; Omni-Frankie controls 8 (7 arm + 1 base); the quadruped controls 16 (12 leg + 4 arm).

This forces policies to be embodiment-aware. A policy conditioned on the robot's joint configuration and the task description can — in principle — generalize across embodiments. RoboCasa's shared environment is the testbed where this hypothesis can be empirically evaluated.

The bigger vision: If the same kitchen task dataset trains policies for three different robot types, what's stopping a single policy from controlling all three? Cross-embodiment generalization is one of the field's open problems — and having shared, realistic task environments is the prerequisite for studying it seriously.

Embodiment Comparison

Click each robot type to compare its action space, mobility, and key capabilities in the RoboCasa kitchen.

Select a robot above.

When three different robot embodiments operate in the same RoboCasa kitchen on the same task, what is the key difference between them?

The success condition — each robot has a different definition of task completion The kitchen layout — each robot gets a different floor plan to match its navigation style The action space — each robot controls a different number and type of joints, while the environment, objects, and success condition remain identical

Chapter 6: Scaling Results

The central empirical claim of RoboCasa: more generated data produces better policies, and the trend is consistent across dataset sizes. This isn't assumed — it's measured directly by training the same policy architecture on datasets of increasing size and evaluating on the same 25 atomic tasks.

Training setup

All experiments use BC-Transformer from RoboMimic — a standard behavior cloning architecture with a transformer backbone. Inputs: camera images (agentview + wrist) + robot proprioception. Output: end-effector delta poses. Multi-task learning: one model trained jointly on all 25 atomic tasks simultaneously. Evaluation: 50 rollouts per task, averaged across all tasks.

The scaling numbers

Dataset	Demos/task	Total demos	Avg success
Human-50 (baseline)	50	1,250	28.8%
Generated-100	100	2,400	34.2%
Generated-300	300	7,200	39.8%
Generated-3000	3,000	72,000	47.6%

Each step up roughly doubles the success rate relative to the gap from 100%. From 28.8% to 47.6% is an 18.8 percentage point improvement — a 65% relative increase — from data that costs human time only to set up the pipeline, not to collect.

Skill-level variation

Averaging across all 25 tasks obscures important structure. Some skills are nearly solved at 72K demos; others are barely begun. The hierarchy of difficulty roughly follows the precision required:

Easiest (~60–70%): open/close doors and drawers. The handle is easy to localize visually; the motion is smooth and repeatable. MimicGen transfers well because the door's local frame fully captures the motion.

Medium (~40–50%): press buttons, twist knobs. Require precise endpoint positioning but the motion is short. Visual localization of buttons is harder when multiple buttons are nearby.

Hardest (~20–35%): pick-and-place, especially with small or round objects. Grasping requires aligning the gripper precisely with an object whose orientation may vary. The success condition (object in target zone) is strict.

Very hard (~10–20%): insertion tasks. Fitting an object into a slot requires sub-centimeter precision. Even 72K demos leaves significant room for improvement.

Scaling Curve — Dataset Size vs. Policy Success

Drag to explore the scaling trend. Toggle between overall average and individual skill categories.

Dataset size (demos/task) 3000

Going from 50 human demos/task to 3000 generated demos/task improves average success from 28.8% to 47.6%. Why do insertion tasks remain the hardest even at 72,000 total demos?

Insertion tasks aren't included in the 25 atomic tasks Insertion requires sub-centimeter precision — the BC-Transformer policy doesn't have enough spatial resolution in its action representation to reliably achieve this even with abundant data MimicGen cannot generate valid insertion trajectories, so only the 50 human demos are used

Chapter 7: Composite Tasks & Fine-Tuning

Atomic tasks are hard. Composite tasks are brutal. Where an atomic task requires one skill executed correctly, a composite task requires 3–6 skills executed correctly in sequence. Any failure cascades: a mug not placed under the coffee machine means pressing the brew button succeeds in simulation but fails in purpose.

How bad is the baseline?

Training BC-Transformer from scratch on 50 human demonstrations per composite task (the same setup that gives 28.8% on atomic tasks) produces near-zero performance on most composite tasks. On 4 out of 5 evaluated tasks, the success rate is literally 0%. On the fifth (a shorter, simpler chain), it's 2%.

This isn't surprising — it follows directly from error compounding. If each of 4 sub-skills has a 70% success rate (optimistic for a 50-demo policy), the probability of completing all 4 in sequence is 0.7⁴ = 24%. If each sub-skill is at 50%, the chain success is 0.5⁴ = 6.25%. Composite tasks amplify every individual skill weakness.

P(composite success) = ∏_i=1^K P(skill_i | prior skills succeeded)

Fine-tuning on atomic tasks helps

The key finding: pre-training a policy on all 25 atomic tasks (using the 72K generated dataset) and then fine-tuning on composite task demos substantially improves performance. ArrangeVegetables goes from 2% (scratch) to 12% (fine-tuned). PrepareCoffee goes from 0% to 6%.

Why does this work? The atomic pre-training gives the policy strong primitives — reliable grasping, precise placement, correct button pressing. The fine-tuning on composite demos teaches the policy how to chain these primitives and recover from partial failures. The policy isn't learning grasping from scratch during composite training — it already knows how to grasp.

The transfer learning analogy: Pre-training on ImageNet and fine-tuning on your specific dataset consistently outperforms training from scratch on your dataset alone. The same logic applies here: atomic task pre-training is the "ImageNet" of robot manipulation. Fine-tuning provides task-specific chaining. The pre-trained features are already good; you're just teaching the policy when and how to use them.

Still far from solved

Even 12% success on ArrangeVegetables means failure 88% of the time. The authors are explicit: composite task performance is a major open problem, not a solved one. RoboCasa's contribution is not solving composite tasks — it's providing the benchmark and the data infrastructure for the field to make progress on them.

Composite Task Results: Scratch vs. Fine-Tuned

Compare training from scratch vs. fine-tuning from the atomic task pre-trained model across 5 composite tasks.

Why does pre-training on 25 atomic tasks (with generated data) and then fine-tuning on composite task demos outperform training on composite tasks from scratch?

Atomic pre-training gives the policy reliable individual manipulation skills; fine-tuning then teaches only the chaining logic — rather than learning both skills and chaining simultaneously from limited composite demos MimicGen can generate composite task trajectories more efficiently after atomic pre-training The composite tasks are a strict subset of the atomic tasks, so transfer is guaranteed

Chapter 8: Sim-to-Real Transfer

All the work so far — 120 kitchen scenes, 2509 objects, 100 tasks, 100K trajectories — is worthless if the policies don't transfer to reality. A robot that works in simulation but fails on a real kitchen counter is not a useful robot.

RoboCasa's sim-to-real experiment is direct: does co-training with sim data improve policies trained on real-world demonstrations?

Experimental setup

The real-world setup is a physical kitchen equipped with a Franka Panda arm on DROID hardware — the same physical robot platform used by several major research labs. Three pick-and-place tasks are evaluated: pick tomato from bowl, pick can from cabinet, pick bread from plate.

For each task, 50 real-world demonstrations are collected by human operators. This is the real-only baseline: a policy trained purely on 50 real demos, evaluated on the same task with both seen objects (training objects) and unseen objects (held-out objects the policy never saw during training).

The real+sim condition adds RoboCasa simulation data to the training mix. The policy is co-trained on the combined dataset — real demos for their physical realism, sim demos for diversity and volume.

The numbers: co-training pays off

Condition	Seen objects	Unseen objects
Real-only (50 demos)	13.6%	2.6%
Real + Sim (co-trained)	24.4%	9.3%
Relative improvement	+79%	+258%

The seen-object improvement (13.6% → 24.4%) is significant but the unseen-object improvement is striking: 2.6% → 9.3%, a 258% relative increase. This is the clearest signal that RoboCasa's diversity is doing real work. The sim data exposes the policy to 2509 different objects; the policy learns more general grasping representations rather than memorizing the appearance of the 3 training objects.

Why does sim-to-real work at all?

The honest answer is: imperfectly, and for specific reasons. RoboCasa's physics (MuJoCo contact dynamics) is not identical to real-world physics. Colors and textures in sim differ from reality. But enough features transfer: relative object positions, approach angles, gripper closure timing.

Domain randomization is the key mechanism. Because training sees 400 different textures and 2509 different objects, the policy cannot rely on any specific visual feature. It must learn invariances — features that predict success across all these variations. Those invariances are more likely to transfer to the real world than features specific to one simulated kitchen.

The practical implication: A lab that can't afford thousands of real demonstrations can collect 50 real demos, download RoboCasa's 100K sim trajectories, co-train, and expect substantially better real-world performance than 50 demos alone would give. The sim data is free. The improvement is real. This is the paper's most actionable result.

Sim-to-Real Results Comparison

Compare real-only vs. co-trained policy on seen and unseen objects. Click a bar to see the numbers.

The co-trained (real+sim) policy improves more on unseen objects (+258%) than seen objects (+79%). What does this tell us about what the sim data is providing?

The sim data makes the policy faster at inference time, which helps more with novel objects The sim data contains the exact unseen test objects, so the policy has seen them before The sim data's object diversity trains more general visual representations — the policy learns grasping features that generalize across object appearances, not just the specific training objects

Chapter 9: Connections & What's Next

RoboCasa sits at the intersection of several active research streams. Understanding where it fits helps you know which papers to read next and which limitations to watch for.

RoboCasa's lineage

RoboCasa is built directly on RoboSuite (Zhu et al., 2020) — the MuJoCo-based robot simulation framework from NVIDIA Research and UT Austin that most of the same authors developed. RoboSuite provides the physics engine, the robot models, and the task infrastructure. RoboCasa adds scale: larger environments, more objects, more tasks, the generative AI pipeline.

MimicGen (Mandlekar et al., 2023) is the trajectory augmentation backbone. RoboCasa uses MimicGen as a component — the contribution is not MimicGen itself but combining MimicGen with a large enough environment that the generated data is meaningfully diverse.

How it compares to related work

Simulator	Scale	Physics	Dataset	Key limitation
RoboCasa	120 scenes, 2509 obj	MuJoCo (high)	100K+ demos	Kitchen only
MetaWorld	50 tasks, minimal scenes	MuJoCo	Small	Tabletop only, no real objects
LIBERO	4 scene suites	MuJoCo	~500 demos	Small dataset, limited object diversity
AI2-THOR	120 scenes	Unity (lower fidelity)	None	No large-scale dataset, lower physics fidelity
Habitat 3.0	Large-scale	Bullet	Navigation focus	Limited manipulation tasks

Implications for VLA training

Vision-Language-Action models like pi-0 and OpenVLA are hungry for diverse robot data. Their performance scales with dataset diversity and volume — exactly what RoboCasa provides. The practical path forward: use RoboCasa's 100K sim trajectories as a pre-training corpus, then fine-tune on real-world task demonstrations. This is the co-training recipe from Chapter 8, applied at VLA scale.

The kitchen is also an ideal domain for language grounding. "Put the apple in the bowl" is a natural-language instruction that maps directly to a pick-and-place task. "Brew me a coffee" maps to PrepareCoffee. RoboCasa's tasks are natural candidates for language-conditioned policy training.

Open challenges RoboCasa does not solve

Composite task performance. 12% on ArrangeVegetables is not deployment-ready. Long-horizon task completion remains an open problem even with unlimited simulation data.

Domain gap persistence. 24.4% real-world success on pick-and-place is still failure 75% of the time. For household deployment, we need 95%+. Better sim-to-real methods (adaptive domain randomization, learned perception modules) are needed.

Kitchen-only scope. The real world has living rooms, bathrooms, garages. A generalist household robot needs simulation environments beyond the kitchen — RoboCasa's methodology will need to be applied to these spaces.

The deeper contribution: RoboCasa's most durable contribution may not be any specific result — it may be the methodology. Text-to-3D for object generation. LLMs for task design. MimicGen for data amplification. Domain randomization at scale. These techniques compose. The next simulation framework will be built on these tools, not starting from scratch.

Where to go next

MimicGen paper

Mandlekar et al. (2023) — The trajectory augmentation method at RoboCasa's core. Understand it deeply to understand RoboCasa's data pipeline. arXiv: 2310.17596

↓

RoboSuite

Zhu et al. (2020) — The physics simulation foundation. Understanding RoboSuite's task API explains how RoboCasa's 100 tasks are implemented. GitHub: robosuite.ai

↓

pi-0 / OpenVLA

VLA models that RoboCasa-style sim data can augment. See the pi-0 and OpenVLA Veanors lessons for how these models ingest robot data.

↓

DROID dataset

Khazatsky et al. (2024) — The real-world dataset used in RoboCasa's sim-to-real experiments. 76K+ real trajectories across 564 scenes. arXiv: 2403.12945

"What I cannot create, I do not understand." — Richard Feynman. In robotics: what a robot cannot simulate, it cannot learn. RoboCasa is an attempt to close that gap.

RoboCasa: Simulation at Scale