The first open benchmark that evaluates generative world models by what actually matters — whether they help embodied agents succeed at real tasks, not how pretty their videos look.
You have a state-of-the-art video generation model — Sora, Wan, Cosmos, Runway Gen4. It can synthesize photorealistic, temporally coherent videos of indoor scenes, outdoor environments, even robotic manipulation sequences. The visual quality is stunning. High FID, great aesthetics scores, passing VBench with flying colors.
Now you give this model to an embodied agent — a robot navigating a house, an arm sliding a block onto a target. The agent uses the model to imagine "what happens if I turn left?" before committing to an action. This is model-based planning: simulate possible futures, pick the best one, act.
Here is the uncomfortable question nobody was answering: do these beautiful imagined futures actually help the agent succeed?
Before World-In-World, every benchmark for world models worked like this:
This is open-loop evaluation. The agent never acts on its predictions. There is no environment feedback. The model generates one video and you judge how pretty it looks.
Consider a concrete example. An agent is in Habitat-Sim, tasked with identifying a heavily occluded object. It needs to decide: turn left or turn right? It uses a world model to imagine both futures. The model that generates controllable, physically consistent videos (even if slightly blurry) helps the agent see around the occlusion. The model that generates beautiful but uncontrollable videos (ignoring the intended action, reverting to its web-video training priors) leaves the agent blind.
Which model wins on VBench? The beautiful one. Which model actually helps the agent? The controllable one.
Closed-loop evaluation works completely differently:
The metric is not "how realistic was the imagined future?" It is "did the agent succeed at its task?" Success rate, path efficiency, answer accuracy. The things that matter in the real world.
Before World-In-World, these benchmarks existed:
None of them asked the fundamental question: do world models help agents succeed at embodied tasks in a closed-loop setting?
Toggle between the two paradigms. In open-loop, the model generates once and is scored on visual quality. In closed-loop, the agent acts, observes, replans — success is measured by task completion.
World-In-World's central finding can be stated in one sentence: visual quality and task success are poorly correlated — controllability is what actually predicts embodied performance.
This is surprising. You would expect that the best-looking world model produces the most useful imagined futures. After all, if the predicted frame perfectly matches reality, the agent can plan on it as if it were ground truth. But "perfectly matches reality" requires two things: (1) the image looks realistic, and (2) the image accurately reflects the intended action. Most current video generators nail (1) and fail at (2).
Figure 2 of the paper is the money shot. The authors plot task success rate (SR) against generation quality (averaged aesthetic + image quality scores from VBench) for every model they benchmark. If visual quality predicted task success, you would see a tight positive correlation — points clustering along a diagonal from bottom-left to top-right.
Instead, the points form a diffuse cloud. Runway Gen4, a proprietary model with the highest visual quality, also achieves the highest success rate (~65%). But Cosmos-P2 has lower visual quality than several models that outperform it on tasks. Post-trained models (marked with †) consistently outperform their zero-shot base versions despite often having comparable or slightly lower visual quality scores.
The paper defines controllability as the alignment between intended actions and the motions actually depicted in the model's predictions. They quantify it as 1 − LPIPS between ground-truth and predicted observations. LPIPS (Learned Perceptual Image Patch Similarity) measures perceptual distance — lower LPIPS means the predicted frame looks more like what should have happened given the action.
When you plot SR against controllability instead of visual quality, the correlation is dramatically tighter (Figure 5b in the paper). Models that faithfully translate actions into visual predictions consistently achieve higher task success. This makes perfect sense: if the agent asks "what happens when I turn left?" and the model shows what happens when you turn right, the generated frame can be stunning — but the plan is garbage.
World-In-World is built on three principles:
Click to toggle between the two scatter plots: SR vs. visual quality (weak correlation) and SR vs. controllability (strong correlation). Each dot is a world model.
World-In-World's planning framework has three phases that cycle continuously: propose, simulate, revise. Think of it as the agent asking three questions in a loop:
After execution, the agent receives a new real observation and the cycle repeats. This is the closed loop: imagination informs action, action produces reality, reality updates imagination.
At time step t, the agent has observation ot (an egocentric RGB or RGB-D image) and task goal g (e.g., "identify the object in the red bounding box" or a target image for navigation). The proposal policy samples M candidate action sequences:
Each Ât(m) = [ât+1, ât+2, ..., ât+L] is a sequence of L elementary actions. Each â is drawn from the agent's action vocabulary V (e.g., {move_forward, turn_left, turn_right, stop} for navigation, or continuous 7-DoF gripper commands for manipulation).
Each candidate plan must be translated into the format the world model expects. This is where the unified action API C enters (Chapter 3). The transformed control input is:
The world model then generates a counterfactual rollout — a sequence of predicted future observations:
Here gθ can be any visual world model: SVD (Stable Video Diffusion), Wan2.1, Cosmos-Predict2, LTX-Video, etc. The model takes the current real observation ot and the control input It(m), and outputs L predicted frames. These frames represent "what the agent would see" if it executed plan m.
The revision policy scores all M simulated rollouts and picks the best one:
The simplest instantiation is score-and-select: compute a task-specific score S for each rollout and pick the argmax:
What is S? It depends on the task:
Let's trace one complete cycle for Active Recognition:
Without the world model, the base VLM policy just greedily picks the action with highest immediate expected reward at each step. It might wander aimlessly — taking 8 steps instead of 3, or going the wrong direction entirely.
Click "Step" to advance through the proposal-simulate-revise-execute cycle. Watch how the agent uses imagined futures to pick better actions.
Here is the problem the action API solves: the agent speaks one language ("turn left, move forward 0.6m"), but each world model speaks a different language. Some want text prompts. Some want camera trajectories as (x, y, φ) tuples. Some want raw low-level action codes. Without a translator, you would need to rewrite the entire planning pipeline for every world model you want to benchmark.
The unified action API C is that translator. It maps the agent's action sequence A into the control inputs I that the specific world model expects:
This single interface lets the same agent, same proposal policy, and same revision policy work with any world model. Swap in SVD, Wan2.1, Cosmos-P2, Runway Gen4, or any future model — only C changes, not the planning logic.
C supports three output formats, matched to the three types of conditioning that current world models accept:
For models like Wan2.1, LTX-Video, and Hunyuan, the controller converts each primitive action into a descriptive phrase using a predefined template, then concatenates them:
The precision problem is immediately visible: "moves forward by 0.2m" is semantically clear to a human, but the video model has no grounding for what 0.2m looks like in this specific scene. It may generate a plausible-looking forward motion that is actually 0.5m or 0.1m. This is why text-conditioned models tend to have lower controllability.
For models like SE3DS, PathDreamer, and NWM (Navigation World Model) that explicitly consume camera poses, the controller translates each action into a geometric transformation:
Camera-conditioned models have an inherent advantage for controllability: the geometric transformation is exact. There is no ambiguity about what "0.2m forward" means — it is a precise translation vector. The model may still hallucinate scene content, but the camera motion itself is accurately specified.
For models like SVD† (post-trained) that directly consume discrete or continuous actions, the controller maps the agent's action vocabulary to the world model's action vocabulary:
This mapping handles vocabulary mismatches: the agent might use {forward, left, right, stop} while the model was trained with {action_0, action_1, action_2, action_3}. The API maintains a bijection between the two vocabularies.
For robotic manipulation, the mapping is more complex: the agent proposes a 7-DoF gripper trajectory [(x, y, z, roll, pitch, yaw, grip)]1:L, and the API may need to discretize or re-parameterize this to match the world model's expected format.
Select an action sequence and see how the API translates it into three different control formats. Click the model buttons to switch between text, camera, and low-level action representations.
World-In-World tests world models on four tasks, each stressing a different capability. Together they cover the full spectrum of embodied intelligence: perception, navigation, language reasoning, and manipulation.
Environment: Habitat-Sim on Matterport3D scenes (29 scenes, 551 episodes).
Goal: Identify a designated target object that is heavily occluded or viewed from an extreme angle. The agent can move to get a better view.
Action space: Navigation primitives: {move_forward, turn_left, turn_right, stop}.
Budget: K = 10 decision steps maximum.
Observation: RGB image (panoramic + front view) with the target marked by a red bounding box.
How the world model helps (two ways):
Metrics: Success Rate (SR) = fraction of episodes where the final predicted label matches the ground-truth label. Mean Trajectory Length = average number of steps before the agent makes its final prediction or exhausts the budget K.
Environment: Habitat-Sim on HM3D scenes (87 scenes, 144 episodes).
Goal: Navigate to the location from which a given goal image was captured.
Action space: Same navigation primitives.
Observation: RGB image (current view).
Goal input: A single reference RGB image showing the target viewpoint.
How the world model helps: The agent simulates candidate navigation plans and compares each predicted final frame against the goal image using LPIPS. The plan whose predicted outcome looks most like the goal image is selected. This is pure perceptual planning — the world model acts as a visual lookahead.
Metrics: SR = does the agent reach within a threshold distance of the goal? SPL = Success weighted by Path Length (penalizes inefficient paths). Mean Trajectory Length.
Environment: Habitat-Sim on HM3D scenes (54 scenes, 184 questions from OpenEQA).
Goal: Answer open-ended natural language questions (e.g., "How many cushions are on the red sofa?") by actively exploring a 3D environment.
Action space: Navigation primitives.
Observation: RGB + panoramic view.
How the world model helps: Both perception and planning, as in AR. For answering: simulated views help the VLM see objects from complementary angles to resolve references to occluded or distant objects. For navigation: rollouts guide the agent to explore views likely to reveal question-relevant information.
Metrics: Answer Score (GPT-judged, 0-1 scale measuring answer quality), Mean Trajectory Length, SPL.
Environment: CoppeliaSim on RLBench tasks (4 tasks, 50 episodes each).
Goal: Control a 7-DoF robotic arm to complete manipulation tasks like "slide the red block onto the blue target."
Action space: Continuous 7-DoF gripper commands [(x, y, z, roll, pitch, yaw, grip)].
Observation: Third-person RGB images of the workspace.
How the world model helps: The agent generates candidate gripper trajectories, the world model predicts the visual outcome of each trajectory (how objects will move), and the agent selects the trajectory most likely to achieve the specified objective. This requires the world model to understand physical dynamics — contact, friction, sliding.
Metrics: SR = task completion rate. Mean Trajectory Length.
Click each task to see its setup: environment, action space, how the world model helps, and key results. The bars show success rates with and without the world model.
Off-the-shelf video generators like Wan2.1 produce visually appealing clips, but they are driven by text prompts and have limited fine-grained low-level control. Without adaptation, they yield only small gains on downstream embodied tasks. The post-training recipe fixes this by fine-tuning pretrained video generators on action-observation data from the target environment.
The core idea: take a powerful visual prior (from web-scale video pretraining) and teach it to respond precisely to embodied actions. The generator keeps its ability to produce realistic scenes but gains the controllability needed for closed-loop planning.
The data is collected from the same simulators used for evaluation, but from disjoint scenes — the training and evaluation environments never overlap. This ensures the world model learns generalizable action-conditioned representations, not memorized scene layouts.
The recipe is straightforward LoRA fine-tuning of the video diffusion model:
This is the architecture of a post-trained world model:
The elegant part: the frozen backbone provides rich visual understanding, while the tiny LoRA adaptor provides action controllability. This separation means post-training is cheap — 1 epoch on 40K examples is sufficient — and the visual quality from pretraining is largely preserved.
The paper also tests whether post-training transfers across scene distributions. Models post-trained on the synthetic Habitat Synthetic Scenes Dataset (HSSD) and tested on real-scanned HM3D/MP3D scenes still show clear gains:
The synthetic-to-real gap exists but is smaller than you might expect. Post-training learns action-conditioned visual representations that transfer — consistent with prior work on adaptable world models (Gao et al., 2025).
Drag the slider to vary the amount of post-training data. Watch how controllability (1-LPIPS alignment between intended and actual actions) improves, and how this maps to task success rate.
The paper evaluates 6+ video generators, 3+ task-focused world models, and their post-trained variants across all four tasks. Let's walk through the results systematically.
Here are the key numbers from Tables 1-3, grouped by insight:
Across AR, A-EQA, ImageNav, and Manipulation, adding a visual world model consistently improves the performance of the base proposal policy. This holds for every world model tested.
The paper benchmarks these models (grouped by type):
Image generators (single-frame prediction):
Video generators (multi-frame prediction, text-conditioned):
Navigation world models (viewpoint/trajectory-conditioned):
Wan2.2 A14B (14B active parameters, mixture-of-experts) achieves 59.53% AR SR in zero-shot mode. Wan2.1† (1.3B parameters, post-trained) achieves 62.98%. The smaller post-trained model beats the larger zero-shot model by 3.45 percentage points. This is the paper's most striking result about post-training: adaptation beats scale in this regime.
Table 4 compares models post-trained on front-view vs. panoramic input. Panoramic input provides a 360° field of view, giving richer spatial context. Surprisingly, panoramic input does not consistently improve performance across all settings:
The likely explanation: converting panoramic equirectangular images to the perspective field-of-view used during evaluation introduces a resolution loss. The resolution degradation sometimes outweighs the benefit of the richer spatial context.
Each row is a world model. The bars show success rate on each task. Post-trained models are highlighted. Sort by clicking the column headers.
World-In-World presents the first scaling laws for world models in embodied settings. Two axes of scaling are studied: training-time data scaling (how much post-training data?) and inference-time compute scaling (how many rollouts per decision step?).
The paper post-trains three models (Wan2.1†, Wan2.2†, SVD†) on datasets of varying size: 400, 2K, 4K, 20K, 40K, and 80K action-observation examples. Each model is post-trained for exactly 1 epoch (the total compute scales linearly with dataset size).
The results (Figure 6) show a clear, consistent upward trend:
The second scaling axis is the number of world-model inferences (rollouts) per decision step. In the proposal-simulate-revise framework, the agent can generate M candidate plans and simulate L steps for each. More rollouts = more candidate futures to choose from = better decisions.
The paper varies the average number of world-model inferences per episode and plots SR (Figure 7):
The two scaling axes are complementary:
The optimal strategy combines both: post-train with as much action-observation data as available, then allocate as much inference-time compute as the time budget allows. Neither axis alone is sufficient — a poorly trained model benefits less from more rollouts (garbage in, garbage out), and a well-trained model with too few rollouts may miss the best plan.
Left: data scaling (SR vs. post-training examples). Right: inference-time scaling (SR vs. rollouts per step). Drag the vertical lines to explore different operating points.
The paper includes several important ablation studies that reveal which design decisions matter most. Let's walk through each one.
This is the flagship finding. The paper plots SR against two metrics:
Figure 5(a) shows SR vs. generation quality: weak, noisy correlation. R² is low. Models with similar quality scores have widely varying success rates.
Figure 5(b) shows SR vs. controllability: much tighter correlation. Models that faithfully translate actions into visual predictions consistently achieve higher SR. Post-trained models cluster in the high-controllability, high-SR region.
The revision policy πrevision scores candidate plans. The paper compares two revision strategies for ImageNav:
Results (Table 5): Augmenting the planner with an action-conditioned world model AND applying a simple LPIPS-based revision policy yields the best results. SVD† with LPIPS revision achieves 47.92% SR and 39.82 SPL, compared to 43.05% SR and 30.96 SPL with VLM-only revision and no world model. Even SVD† with VLM revision achieves 45.14% SR.
A natural question: would it be better to skip the world model entirely and invest in a better base policy? Table 5 addresses this for ImageNav:
Adding a world model gives a 10-12 pp boost. To get the same improvement from a better base policy alone (without imagined rollouts), you would need a dramatically more capable VLM — and it's not clear such a model exists.
Models post-trained on synthetic Habitat Synthetic Scenes Dataset (HSSD) and evaluated on real-world scanned HM3D/MP3D scenes:
The synthetic-to-real gap is 1-4 pp for AR and 3-5 pp for ImageNav. Remarkably small. The action-conditioned representations learned during post-training generalize across scene distributions, even from synthetic to real scanned environments.
Every paper has blind spots. Here are the ones worth noting:
Each dot represents a model configuration. Left plot shows SR vs. visual quality (weak R²). Right shows SR vs. controllability (strong R²). Toggle to compare.
World-In-World forces a paradigm shift in how we evaluate world models. Before this paper, the field optimized for FID and aesthetic scores — making predictions look like real videos. After this paper, the question becomes: do these predictions help agents act?
The answers are nuanced: world models help substantially for perception and navigation, modestly for question answering, and barely for manipulation. Controllability matters more than visual quality. Post-training on action-observation data is cheap and effective. More inference-time compute reliably buys more success.
The analogy that sticks: before World-In-World, we were judging chess engines by how pretty their board renderings look. Now we are judging them by whether they win the game.