Zhang, Jiang, Dai, Lu et al. — JHU / PKU / Princeton / MIT / Harvard, ICLR 2026 (Oral)

World-In-World: World Models in a Closed-Loop World

The first open benchmark that evaluates generative world models by what actually matters — whether they help embodied agents succeed at real tasks, not how pretty their videos look.

Prerequisites: Video diffusion models + Model Predictive Control (MPC) + Embodied AI basics
10
Chapters
7+
Simulations

Chapter 0: The Problem

You have a state-of-the-art video generation model — Sora, Wan, Cosmos, Runway Gen4. It can synthesize photorealistic, temporally coherent videos of indoor scenes, outdoor environments, even robotic manipulation sequences. The visual quality is stunning. High FID, great aesthetics scores, passing VBench with flying colors.

Now you give this model to an embodied agent — a robot navigating a house, an arm sliding a block onto a target. The agent uses the model to imagine "what happens if I turn left?" before committing to an action. This is model-based planning: simulate possible futures, pick the best one, act.

Here is the uncomfortable question nobody was answering: do these beautiful imagined futures actually help the agent succeed?

The Open-Loop Evaluation Trap

Before World-In-World, every benchmark for world models worked like this:

  1. Give the model an image and an action (or text prompt).
  2. The model generates a predicted future frame (or video).
  3. Compare the generated frame to a ground-truth frame using visual quality metrics: FID, LPIPS, SSIM, aesthetic scores.

This is open-loop evaluation. The agent never acts on its predictions. There is no environment feedback. The model generates one video and you judge how pretty it looks.

The fundamental gap: Open-loop evaluation tells you how realistic a world model's predictions look. It says nothing about whether those predictions are useful for decision-making. A model could produce visually stunning but physically impossible futures — a hallucinated door that doesn't exist, a table that clips through the wall — and still score well on FID. But an agent that plans based on those hallucinations will fail catastrophically.

Consider a concrete example. An agent is in Habitat-Sim, tasked with identifying a heavily occluded object. It needs to decide: turn left or turn right? It uses a world model to imagine both futures. The model that generates controllable, physically consistent videos (even if slightly blurry) helps the agent see around the occlusion. The model that generates beautiful but uncontrollable videos (ignoring the intended action, reverting to its web-video training priors) leaves the agent blind.

Which model wins on VBench? The beautiful one. Which model actually helps the agent? The controllable one.

Closed-Loop: What Actually Matters

Closed-loop evaluation works completely differently:

  1. The agent observes the real environment.
  2. It proposes multiple candidate action plans.
  3. The world model simulates each plan — predicting what will happen.
  4. The agent picks the best plan and executes it in the real environment.
  5. The environment returns a new observation.
  6. Repeat from step 1.

The metric is not "how realistic was the imagined future?" It is "did the agent succeed at its task?" Success rate, path efficiency, answer accuracy. The things that matter in the real world.

The analogy: Open-loop evaluation is like grading a chess engine by how realistic its board visualizations look. Closed-loop evaluation is like grading it by whether it wins the game. World-In-World is the first benchmark that plays the game.

Before World-In-World, these benchmarks existed:

None of them asked the fundamental question: do world models help agents succeed at embodied tasks in a closed-loop setting?

Open-Loop vs. Closed-Loop Evaluation

Toggle between the two paradigms. In open-loop, the model generates once and is scored on visual quality. In closed-loop, the agent acts, observes, replans — success is measured by task completion.

Why can a world model score well on visual quality benchmarks like VBench yet fail to help an embodied agent succeed?

Chapter 1: The Key Insight

World-In-World's central finding can be stated in one sentence: visual quality and task success are poorly correlated — controllability is what actually predicts embodied performance.

This is surprising. You would expect that the best-looking world model produces the most useful imagined futures. After all, if the predicted frame perfectly matches reality, the agent can plan on it as if it were ground truth. But "perfectly matches reality" requires two things: (1) the image looks realistic, and (2) the image accurately reflects the intended action. Most current video generators nail (1) and fail at (2).

The Visual Quality vs. Task Success Scatter Plot

Figure 2 of the paper is the money shot. The authors plot task success rate (SR) against generation quality (averaged aesthetic + image quality scores from VBench) for every model they benchmark. If visual quality predicted task success, you would see a tight positive correlation — points clustering along a diagonal from bottom-left to top-right.

Instead, the points form a diffuse cloud. Runway Gen4, a proprietary model with the highest visual quality, also achieves the highest success rate (~65%). But Cosmos-P2 has lower visual quality than several models that outperform it on tasks. Post-trained models (marked with †) consistently outperform their zero-shot base versions despite often having comparable or slightly lower visual quality scores.

Three surprises from the paper:
1. Visual quality alone doesn't guarantee task success. Controllability — whether the model's predictions actually reflect the intended actions — matters more.
2. Scaling post-training data is more effective than upgrading the base model. Wan2.1† with 80K action-observation examples outperforms Wan2.2 (A14B), a model with 14B active parameters, despite being much smaller.
3. More inference-time compute = better closed-loop performance. Increasing the number of imagined rollouts per decision step from 3 to 11 raises Active Recognition success from 53.36% to 60.98%.

Controllability: The Missing Metric

The paper defines controllability as the alignment between intended actions and the motions actually depicted in the model's predictions. They quantify it as 1 − LPIPS between ground-truth and predicted observations. LPIPS (Learned Perceptual Image Patch Similarity) measures perceptual distance — lower LPIPS means the predicted frame looks more like what should have happened given the action.

When you plot SR against controllability instead of visual quality, the correlation is dramatically tighter (Figure 5b in the paper). Models that faithfully translate actions into visual predictions consistently achieve higher task success. This makes perfect sense: if the agent asks "what happens when I turn left?" and the model shows what happens when you turn right, the generated frame can be stunning — but the plan is garbage.

Why video generators have poor controllability: Most SoTA video models (Wan, LTX-Video, Hunyuan) are trained on web-scale video data with text prompts. They learn to generate plausible videos, but they learn statistical priors from web video, not physics-faithful action-conditioned predictions. When you prompt "the camera moves forward 0.6m," the model may generate a visually plausible forward motion — but not the precise 0.6m translation the agent intended. Post-training on action-observation data from the target environment fixes this by aligning the model's outputs with actual physics.

The Design Philosophy

World-In-World is built on three principles:

  1. Task success is the primary metric. Not FID, not LPIPS, not aesthetic score. Did the agent accomplish its goal?
  2. Any world model should be pluggable. The unified action API translates between the agent's action space and whatever format the world model expects (text, camera trajectory, low-level actions).
  3. Closed-loop interaction is non-negotiable. The agent must observe → plan → act → re-observe → re-plan. One-shot prediction is not enough.
The MPC connection: World-In-World generalizes Model Predictive Control (MPC). In classical MPC, you have an explicit dynamics model f(s, a) = s', you simulate M candidate action sequences, you pick the best one based on a reward function, and you execute the first action. World-In-World does the same thing but replaces the dynamics model with a visual world model (video generator), and the reward function with a learned revision policy that can be a VLM, a perceptual similarity metric, or a rule-based heuristic.
Visual Quality vs. Controllability vs. Task Success

Click to toggle between the two scatter plots: SR vs. visual quality (weak correlation) and SR vs. controllability (strong correlation). Each dot is a world model.

The paper finds that controllability correlates more strongly with task success than visual quality. What is controllability measuring?

Chapter 2: Closed-Loop Online Planning

World-In-World's planning framework has three phases that cycle continuously: propose, simulate, revise. Think of it as the agent asking three questions in a loop:

  1. "What could I do next?" (Propose M candidate action plans.)
  2. "What would happen if I did each one?" (Simulate each plan with the world model.)
  3. "Which future looks best?" (Score and select the best plan, then execute it.)

After execution, the agent receives a new real observation and the cycle repeats. This is the closed loop: imagination informs action, action produces reality, reality updates imagination.

Step 1: Proposal — πproposal

At time step t, the agent has observation ot (an egocentric RGB or RGB-D image) and task goal g (e.g., "identify the object in the red bounding box" or a target image for navigation). The proposal policy samples M candidate action sequences:

Ât(m) ~ πproposal(A | ot, g),    m = 1, ..., M

Each Ât(m) = [ât+1, ât+2, ..., ât+L] is a sequence of L elementary actions. Each â is drawn from the agent's action vocabulary V (e.g., {move_forward, turn_left, turn_right, stop} for navigation, or continuous 7-DoF gripper commands for manipulation).

What πproposal can be:
• A VLM (vision-language model): Given the current image and goal, the VLM reasons about which actions might make progress. This is the default for AR, A-EQA, and ImageNav.
• A diffusion policy: For manipulation, a 3D diffusion policy proposes continuous gripper trajectories.
• A heuristic: For simple tasks, you can even enumerate all possible single-action plans (M = |V|).
The key: πproposal is the same base policy the agent would use without a world model. The world model adds value by evaluating plans before execution, not by generating them.

Step 2: Simulation — The World Model gθ

Each candidate plan must be translated into the format the world model expects. This is where the unified action API C enters (Chapter 3). The transformed control input is:

It(m) = C(Ât(m))

The world model then generates a counterfactual rollout — a sequence of predicted future observations:

Ôt(m) ~ gθ(O | ot, It(m)),    Ôt(m) = [ôt+1(m), ôt+2(m), ..., ôt+L(m)]

Here gθ can be any visual world model: SVD (Stable Video Diffusion), Wan2.1, Cosmos-Predict2, LTX-Video, etc. The model takes the current real observation ot and the control input It(m), and outputs L predicted frames. These frames represent "what the agent would see" if it executed plan m.

Data flow through the world model (concrete example):
Input: ot = RGB image [480×640×3] of a hallway in Habitat-Sim, It(m) = text prompt "The camera moves forward by 0.6m" (for text-conditioned models) or camera trajectory [(0,0,0°), (0.2,0,0°), (0.4,0,0°)] (for camera-conditioned models).
Process: The video diffusion model encodes ot into a latent z0, runs reverse diffusion conditioned on It(m), decodes L=4 future latents into pixel-space.
Output: Ôt(m) = 4 predicted RGB frames [480×640×3] showing what the hallway looks like 1, 2, 3, 4 steps ahead under this action plan.

Step 3: Revision — πrevision

The revision policy scores all M simulated rollouts and picks the best one:

Dt* = πrevision({(Ât(m), Ôt(m))}m=1M, ot, g)

The simplest instantiation is score-and-select: compute a task-specific score S for each rollout and pick the argmax:

Dt* = Ât(m*),   where   m* = argmaxm ∈ {1,...,M} S(Ât(m), Ôt(m) | ot, g)

What is S? It depends on the task:

Beyond score-and-select: The paper notes that πrevision can do more than just picking the highest-scoring candidate. It can synthesize a new decision by aggregating information across all candidates and their predicted consequences. For example, a VLM might reason: "Plan 1 shows the object is to the left, Plan 2 shows a wall ahead. Therefore I should combine: turn left first, then move forward." This makes the framework strictly more general than classical MPC, which only selects from proposed action sequences.

The Full Loop (Worked Example)

Let's trace one complete cycle for Active Recognition:

  1. t=0: Agent sees a room from an extreme angle. Target object (a lamp) is 80% occluded. VLM confidence: 35%.
  2. Propose: The VLM suggests M=2 plans: Â(1) = [turn_left, turn_left, move_forward, move_forward] and Â(2) = [turn_right, move_forward, move_forward, move_forward]. L=4 steps each.
  3. Simulate: World model generates 4 future frames for each plan. Plan 1 reveals the lamp from a better angle. Plan 2 shows a wall.
  4. Revise: VLM examines simulated frames + real observation. With Plan 1's futures, confidence rises to 72%. With Plan 2's, only 40%. Score(Plan 1) > Score(Plan 2).
  5. Execute: Agent executes the first action of Plan 1: turn_left. Environment returns new real observation o1.
  6. t=1: Re-enter the loop with o1. The lamp is now partially visible. VLM confidence: 55%. Propose new plans, simulate, revise, execute.
  7. t=3: After 3 steps, VLM confidence exceeds 95% threshold. Agent outputs its answer: "lamp." Correct!

Without the world model, the base VLM policy just greedily picks the action with highest immediate expected reward at each step. It might wander aimlessly — taking 8 steps instead of 3, or going the wrong direction entirely.

Closed-Loop Planning Pipeline

Click "Step" to advance through the proposal-simulate-revise-execute cycle. Watch how the agent uses imagined futures to pick better actions.

Step 0: Observe
In the closed-loop planning framework, what makes the revision policy more general than classical Model Predictive Control (MPC)?

Chapter 3: The Unified Action API

Here is the problem the action API solves: the agent speaks one language ("turn left, move forward 0.6m"), but each world model speaks a different language. Some want text prompts. Some want camera trajectories as (x, y, φ) tuples. Some want raw low-level action codes. Without a translator, you would need to rewrite the entire planning pipeline for every world model you want to benchmark.

The unified action API C is that translator. It maps the agent's action sequence A into the control inputs I that the specific world model expects:

I = C(A)

This single interface lets the same agent, same proposal policy, and same revision policy work with any world model. Swap in SVD, Wan2.1, Cosmos-P2, Runway Gen4, or any future model — only C changes, not the planning logic.

Three Control Modalities

C supports three output formats, matched to the three types of conditioning that current world models accept:

1. Text Prompts (Image-and-Text-to-Video Models)

For models like Wan2.1, LTX-Video, and Hunyuan, the controller converts each primitive action into a descriptive phrase using a predefined template, then concatenates them:

Worked example — text prompt generation:
Agent's action sequence: [move_forward, move_forward, turn_left]
Template mapping: move_forward → "The camera moves forward by 0.2m"
                    turn_left → "The camera rotates left by 22.5°"
Concatenated prompt Itext: "The camera moves forward by 0.2m. The camera moves forward by 0.2m. The camera rotates left by 22.5°."

This is fed to the video diffusion model alongside the current observation image ot. The model generates L frames conditioned on this text description of the intended motion.

The precision problem is immediately visible: "moves forward by 0.2m" is semantically clear to a human, but the video model has no grounding for what 0.2m looks like in this specific scene. It may generate a plausible-looking forward motion that is actually 0.5m or 0.1m. This is why text-conditioned models tend to have lower controllability.

2. Camera Trajectory / Viewpoint (3D-Aware Models)

For models like SE3DS, PathDreamer, and NWM (Navigation World Model) that explicitly consume camera poses, the controller translates each action into a geometric transformation:

Worked example — camera trajectory:
Agent's action sequence: [move_forward, turn_right, move_forward]
Each move_forward: translate camera by (0.2, 0, 0) in the current heading direction.
Each turn_right: rotate azimuth by -22.5°.
Resulting trajectory: [(x0, y0, φ0), (x0+0.2cosφ0, y0+0.2sinφ0, φ0), (x1, y1, φ0-22.5°), (x1+0.2cos(φ0-22.5°), y1+0.2sin(φ0-22.5°), φ0-22.5°)]

Formally: {(xk, yk, φk)}k=1K with (xk, yk) ∈ R2 and azimuth φk ∈ R.

Camera-conditioned models have an inherent advantage for controllability: the geometric transformation is exact. There is no ambiguity about what "0.2m forward" means — it is a precise translation vector. The model may still hallucinate scene content, but the camera motion itself is accurately specified.

3. Low-Level Actions (Action-Conditioned Models)

For models like SVD† (post-trained) that directly consume discrete or continuous actions, the controller maps the agent's action vocabulary to the world model's action vocabulary:

A ↦ Aworld

This mapping handles vocabulary mismatches: the agent might use {forward, left, right, stop} while the model was trained with {action_0, action_1, action_2, action_3}. The API maintains a bijection between the two vocabularies.

For robotic manipulation, the mapping is more complex: the agent proposes a 7-DoF gripper trajectory [(x, y, z, roll, pitch, yaw, grip)]1:L, and the API may need to discretize or re-parameterize this to match the world model's expected format.

Why the action API is a key contribution: Without it, every (world model, task) pair requires custom engineering. With N world models and K tasks, that is N×K integration efforts. The unified API reduces this to N + K: implement C once per model (N adaptors), implement the planning loop once per task (K configurations), and any model works with any task. This is what makes the benchmark scalable to future models.

The Full Translation Pipeline

Agent Action
A = [move_forward, turn_left, move_forward, move_forward]
From the agent's discrete action vocabulary V
API Dispatch
C detects the target world model type and selects the appropriate conversion path: text → Itext, camera → Icam, or low-level → Iaction
Control Input
For Wan2.1 (text): "The camera moves forward by 0.2m. The camera rotates left by 22.5°..."
For SE3DS (camera): {(0,0,0°), (0.2,0,0°), (0.2,0,22.5°), ...}
For SVD† (action): [0, 3, 0, 0] (action indices)
World Model
gθ(O | ot, I) produces L predicted future frames
Same observation ot, same intended actions, different model — directly comparable outputs
Unified Action API: Translation Paths

Select an action sequence and see how the API translates it into three different control formats. Click the model buttons to switch between text, camera, and low-level action representations.

Why do camera-trajectory-conditioned models tend to have higher controllability than text-prompted models?

Chapter 4: The Four Embodied Tasks

World-In-World tests world models on four tasks, each stressing a different capability. Together they cover the full spectrum of embodied intelligence: perception, navigation, language reasoning, and manipulation.

Why four tasks, not one? A world model might be excellent at predicting what is around the corner (helping perception) but terrible at predicting how objects move when pushed (failing at manipulation). A single task would give a misleadingly narrow picture. Four tasks spanning perception, navigation, reasoning, and physical interaction force a model to demonstrate general utility.

Task 1: Active Recognition (AR)

Environment: Habitat-Sim on Matterport3D scenes (29 scenes, 551 episodes).

Goal: Identify a designated target object that is heavily occluded or viewed from an extreme angle. The agent can move to get a better view.

Action space: Navigation primitives: {move_forward, turn_left, turn_right, stop}.

Budget: K = 10 decision steps maximum.

Observation: RGB image (panoramic + front view) with the target marked by a red bounding box.

How the world model helps (two ways):

  1. Perception enhancement: The VLM sees both the real observation AND the simulated future views when answering "what is this object?" Synthetic views from different angles provide additional evidence that helps resolve ambiguity from occlusion.
  2. Planning enhancement: Before committing to move_forward or turn_left, the agent simulates both options. The rollout that reveals more of the target object gets a higher score.

Metrics: Success Rate (SR) = fraction of episodes where the final predicted label matches the ground-truth label. Mean Trajectory Length = average number of steps before the agent makes its final prediction or exhausts the budget K.

Concrete numbers: The best proprietary model (Runway Gen4) achieves 64.79% SR with mean 4.06 steps. The best open-source post-trained model (Wan2.1†) achieves 62.98% SR with mean 4.71 steps. The VLM base policy without any world model achieves only 50.27% SR with mean 6.24 steps. That is a +12.7 percentage point improvement from adding a world model, plus the agent reaches its answer 2 steps faster.

Task 2: Image-Goal Navigation (ImageNav)

Environment: Habitat-Sim on HM3D scenes (87 scenes, 144 episodes).

Goal: Navigate to the location from which a given goal image was captured.

Action space: Same navigation primitives.

Observation: RGB image (current view).

Goal input: A single reference RGB image showing the target viewpoint.

How the world model helps: The agent simulates candidate navigation plans and compares each predicted final frame against the goal image using LPIPS. The plan whose predicted outcome looks most like the goal image is selected. This is pure perceptual planning — the world model acts as a visual lookahead.

Metrics: SR = does the agent reach within a threshold distance of the goal? SPL = Success weighted by Path Length (penalizes inefficient paths). Mean Trajectory Length.

Concrete numbers: Best post-trained model (Wan2.1†) achieves 45.14% SR with mean path 45.8 steps. VLM base policy without world model: 35.42% SR with mean 47.5 steps. Shorter path AND higher success rate with the world model.

Task 3: Active Embodied Question Answering (A-EQA)

Environment: Habitat-Sim on HM3D scenes (54 scenes, 184 questions from OpenEQA).

Goal: Answer open-ended natural language questions (e.g., "How many cushions are on the red sofa?") by actively exploring a 3D environment.

Action space: Navigation primitives.

Observation: RGB + panoramic view.

How the world model helps: Both perception and planning, as in AR. For answering: simulated views help the VLM see objects from complementary angles to resolve references to occluded or distant objects. For navigation: rollouts guide the agent to explore views likely to reveal question-relevant information.

Metrics: Answer Score (GPT-judged, 0-1 scale measuring answer quality), Mean Trajectory Length, SPL.

Concrete numbers: Best model (Wan2.2† A14B) achieves answer score 48.4 and SPL 31.9, surpassing the VLM base policy at 45.7 answer score and 29.6 SPL.

Task 4: Robotic Manipulation

Environment: CoppeliaSim on RLBench tasks (4 tasks, 50 episodes each).

Goal: Control a 7-DoF robotic arm to complete manipulation tasks like "slide the red block onto the blue target."

Action space: Continuous 7-DoF gripper commands [(x, y, z, roll, pitch, yaw, grip)].

Observation: Third-person RGB images of the workspace.

How the world model helps: The agent generates candidate gripper trajectories, the world model predicts the visual outcome of each trajectory (how objects will move), and the agent selects the trajectory most likely to achieve the specified objective. This requires the world model to understand physical dynamics — contact, friction, sliding.

Metrics: SR = task completion rate. Mean Trajectory Length.

The manipulation gap: World models struggle here. The best model (SVD†) achieves only 46.5% SR, barely above the VLM baseline at 44.5%. Why? Manipulation requires modeling contact-rich interactions, compliance, friction, and deformable objects. Current video generators trained on web video have almost no training signal for these physics. Predicting "what happens when a gripper pushes a block" is fundamentally harder than predicting "what does the hallway look like from 2 meters ahead." This is the clearest gap identified by the benchmark.
The Four Embodied Tasks

Click each task to see its setup: environment, action space, how the world model helps, and key results. The bars show success rates with and without the world model.

Why do world models provide much less improvement for robotic manipulation compared to active recognition and navigation?

Chapter 5: The Post-Training Recipe

Off-the-shelf video generators like Wan2.1 produce visually appealing clips, but they are driven by text prompts and have limited fine-grained low-level control. Without adaptation, they yield only small gains on downstream embodied tasks. The post-training recipe fixes this by fine-tuning pretrained video generators on action-observation data from the target environment.

The core idea: take a powerful visual prior (from web-scale video pretraining) and teach it to respond precisely to embodied actions. The generator keeps its ability to produce realistic scenes but gains the controllability needed for closed-loop planning.

Where the Post-Training Data Comes From

The data is collected from the same simulators used for evaluation, but from disjoint scenes — the training and evaluation environments never overlap. This ensures the world model learns generalizable action-conditioned representations, not memorized scene layouts.

Data format (detailed): For Habitat-Sim, each training example consists of:
1. A panoramic input image (equirectangular projection, capturing 360° around the agent)
2. An action label from the discrete navigation vocabulary
3. The resulting panoramic observation after executing that action

The model learns to predict frame 3 given frames 1-2. During post-training, the panoramic images are converted to perspective views matching the same field-of-view used during evaluation. This panorama-to-perspective conversion introduces a resolution loss that can slightly degrade generation quality (Table 4 in the paper).

The Post-Training Procedure

The recipe is straightforward LoRA fine-tuning of the video diffusion model:

  1. Freeze the pretrained video generator's main weights.
  2. Add LoRA adaptors to the attention layers (low-rank matrices that modify the model's behavior with minimal parameter overhead).
  3. Train on the action-observation dataset for 1 epoch. The loss is the standard video diffusion denoising objective, but now conditioned on action inputs instead of arbitrary text.
  4. No additional tricks — no reward models, no reinforcement learning, no data augmentation beyond standard image preprocessing.
Why post-training works so well (the authors' hypothesis): Pretrained video generators already have a powerful visual prior — they know what indoor scenes, hallways, kitchens, and object surfaces look like. What they lack is grounding: the ability to translate "move forward 0.2m" into a specific, metrically accurate camera motion in the current scene. Post-training provides exactly this grounding. A modest amount of action-conditioned data (even just 4K examples) is enough to significantly improve controllability, because the model only needs to learn the action-to-motion mapping, not how to generate scenes from scratch.

Frozen vs. Trained: What Changes?

This is the architecture of a post-trained world model:

FROZEN
Video U-Net / DiT backbone (Wan2.1, SVD, etc.) — pretrained on web-scale video
Contains the visual prior: scene geometry, appearance, temporal consistency
Parameters: 1.5B-14B depending on model
TRAINED
LoRA adaptors on attention layers — fine-tuned on action-observation data
Learns: action grounding, metric accuracy, environment-specific dynamics
Parameters: ~0.5% of total (minimal overhead)
FROZEN
VAE decoder — converts latents back to pixel space
Unchanged from pretraining

The elegant part: the frozen backbone provides rich visual understanding, while the tiny LoRA adaptor provides action controllability. This separation means post-training is cheap — 1 epoch on 40K examples is sufficient — and the visual quality from pretraining is largely preserved.

Cross-Domain Transfer

The paper also tests whether post-training transfers across scene distributions. Models post-trained on the synthetic Habitat Synthetic Scenes Dataset (HSSD) and tested on real-scanned HM3D/MP3D scenes still show clear gains:

The synthetic-to-real gap exists but is smaller than you might expect. Post-training learns action-conditioned visual representations that transfer — consistent with prior work on adaptable world models (Gao et al., 2025).

Post-Training Effect on Controllability

Drag the slider to vary the amount of post-training data. Watch how controllability (1-LPIPS alignment between intended and actual actions) improves, and how this maps to task success rate.


Why can a small amount of post-training data (even 4K examples) dramatically improve a pretrained video generator's embodied performance?

Chapter 6: Benchmark Results

The paper evaluates 6+ video generators, 3+ task-focused world models, and their post-trained variants across all four tasks. Let's walk through the results systematically.

The Leaderboard at a Glance

Here are the key numbers from Tables 1-3, grouped by insight:

Finding 1: World Models Consistently Help

Across AR, A-EQA, ImageNav, and Manipulation, adding a visual world model consistently improves the performance of the base proposal policy. This holds for every world model tested.

Base policy performance (no world model) vs. best world model augmented:
AR: 50.27% → 64.79% (+14.52 pp) with Runway Gen4
ImageNav: 35.42% → 45.14% (+9.72 pp) with Wan2.1†
A-EQA: Answer score 45.7 → 48.4 (+2.7) with Wan2.2† A14B
Manipulation: 44.5% → 46.5% (+2.0 pp) with SVD† (modest gain)

The improvement is largest for perception-heavy tasks (AR) and smallest for manipulation. Mean trajectory length also improves — agents reach their goals in fewer steps.

Finding 2: The Model Zoo

The paper benchmarks these models (grouped by type):

Image generators (single-frame prediction):

Video generators (multi-frame prediction, text-conditioned):

Navigation world models (viewpoint/trajectory-conditioned):

Finding 3: Size Isn't Everything

Wan2.2 A14B (14B active parameters, mixture-of-experts) achieves 59.53% AR SR in zero-shot mode. Wan2.1† (1.3B parameters, post-trained) achieves 62.98%. The smaller post-trained model beats the larger zero-shot model by 3.45 percentage points. This is the paper's most striking result about post-training: adaptation beats scale in this regime.

Why adaptation beats scale: The 14B model has more parameters but no exposure to action-conditioned data. It generates beautiful hallways, but when told "turn left 22.5°," it may generate a visually plausible scene that corresponds to turning 45° or not turning at all. The 1.3B post-trained model has seen thousands of (action, resulting view) pairs and has learned precise action-to-motion mappings. Its predictions are less pretty but more accurate — and accuracy is what matters for planning.

Finding 4: Front View vs. Panorama

Table 4 compares models post-trained on front-view vs. panoramic input. Panoramic input provides a 360° field of view, giving richer spatial context. Surprisingly, panoramic input does not consistently improve performance across all settings:

The likely explanation: converting panoramic equirectangular images to the perspective field-of-view used during evaluation introduces a resolution loss. The resolution degradation sometimes outweighs the benefit of the richer spatial context.

Model Leaderboard: Success Rate Across Tasks

Each row is a world model. The bars show success rate on each task. Post-trained models are highlighted. Sort by clicking the column headers.

Wan2.1-dagger (1.3B, post-trained) outperforms Wan2.2 A14B (14B, zero-shot) on Active Recognition. What does this tell us about the relative value of scale vs. adaptation for embodied world models?

Chapter 7: Scaling Laws

World-In-World presents the first scaling laws for world models in embodied settings. Two axes of scaling are studied: training-time data scaling (how much post-training data?) and inference-time compute scaling (how many rollouts per decision step?).

Data Scaling: More Post-Training Data = Better Performance

The paper post-trains three models (Wan2.1†, Wan2.2†, SVD†) on datasets of varying size: 400, 2K, 4K, 20K, 40K, and 80K action-observation examples. Each model is post-trained for exactly 1 epoch (the total compute scales linearly with dataset size).

The results (Figure 6) show a clear, consistent upward trend:

The diminishing-returns pattern: The steepest improvement happens in the first 4K examples. From 0 to 4K, the model learns the basic action-to-motion mapping (which direction is "left"?, what does "0.2m forward" look like?). From 4K to 80K, it refines this mapping across diverse scenes and edge cases. This matches the intuition from Chapter 5: the visual prior is already strong from pretraining; the model just needs to calibrate its action grounding.

Crucially, no model has saturated at 80K. The curves are still rising. More post-training data would likely continue to improve performance. This is the "first data scaling law for world models in embodied settings."

Inference-Time Scaling: More Rollouts = Better Decisions

The second scaling axis is the number of world-model inferences (rollouts) per decision step. In the proposal-simulate-revise framework, the agent can generate M candidate plans and simulate L steps for each. More rollouts = more candidate futures to choose from = better decisions.

The paper varies the average number of world-model inferences per episode and plots SR (Figure 7):

The compute-performance tradeoff: Each additional rollout costs one forward pass through the video diffusion model (typically 20-50 denoising steps). For SVD (1.5B params), each rollout takes ~2-5 seconds on an A100. For Wan2.1 (5B params), ~5-15 seconds. Increasing from 3 to 11 rollouts per step means ~3.7x more wall-clock time at each decision step.

But the improvement is substantial: +7 pp of SR from more thinking time. This is the embodied-world-model analogue of inference-time compute scaling in language models (more CoT tokens = better reasoning). The agent literally benefits from "thinking harder" — imagining more possible futures before acting.

Putting Both Scaling Laws Together

The two scaling axes are complementary:

The optimal strategy combines both: post-train with as much action-observation data as available, then allocate as much inference-time compute as the time budget allows. Neither axis alone is sufficient — a poorly trained model benefits less from more rollouts (garbage in, garbage out), and a well-trained model with too few rollouts may miss the best plan.

Worked example — inference-time compute tradeoff:
Scenario: Active Recognition episode, budget K=10 steps. SVD† takes 3 seconds per rollout.

Option A: M=2 rollouts per step. Per-step time: 6s. Total: 60s. SR: ~53%.
Option B: M=6 rollouts per step. Per-step time: 18s. Total: 180s. SR: ~58%.
Option C: M=11 rollouts per step. Per-step time: 33s. Total: 330s. SR: ~61%.

Going from 2 to 11 rollouts costs 5.5x more time but gains +8 pp SR. Depending on the application (real-time robot vs. offline planning), this tradeoff may or may not be worth it. But the existence of the scaling law — more compute reliably buys more success — is itself a significant finding.
Dual Scaling Laws: Data + Inference

Left: data scaling (SR vs. post-training examples). Right: inference-time scaling (SR vs. rollouts per step). Drag the vertical lines to explore different operating points.

The paper shows two independent scaling laws. What does each one improve, and why are they complementary?

Chapter 8: Ablations & Design Choices

The paper includes several important ablation studies that reveal which design decisions matter most. Let's walk through each one.

Ablation 1: Controllability vs. Visual Quality

This is the flagship finding. The paper plots SR against two metrics:

Figure 5(a) shows SR vs. generation quality: weak, noisy correlation. R² is low. Models with similar quality scores have widely varying success rates.

Figure 5(b) shows SR vs. controllability: much tighter correlation. Models that faithfully translate actions into visual predictions consistently achieve higher SR. Post-trained models cluster in the high-controllability, high-SR region.

The interpretation: When you post-train a model, two things change. Visual quality may stay the same or even decrease slightly (because LoRA adaption can slightly degrade the pretrained generation ability). But controllability increases dramatically (because the model learns precise action-to-motion mappings). The fact that SR correlates with controllability, not quality, tells us that controllability is the bottleneck for embodied utility. Current models are already "good enough" visually; what they lack is the ability to faithfully execute the agent's intended actions in their predictions.

Ablation 2: Effect of Revision Policy

The revision policy πrevision scores candidate plans. The paper compares two revision strategies for ImageNav:

Results (Table 5): Augmenting the planner with an action-conditioned world model AND applying a simple LPIPS-based revision policy yields the best results. SVD† with LPIPS revision achieves 47.92% SR and 39.82 SPL, compared to 43.05% SR and 30.96 SPL with VLM-only revision and no world model. Even SVD† with VLM revision achieves 45.14% SR.

The takeaway: A simple perceptual similarity metric (LPIPS) can be a better revision policy than a complex VLM for tasks where the goal is defined by an image. The LPIPS revision policy is also much faster (no LLM forward pass), making it practical for real-time applications. This demonstrates that the revision policy design space is rich and task-dependent.

Ablation 3: World Model Augmentation vs. Better Policy

A natural question: would it be better to skip the world model entirely and invest in a better base policy? Table 5 addresses this for ImageNav:

Adding a world model gives a 10-12 pp boost. To get the same improvement from a better base policy alone (without imagined rollouts), you would need a dramatically more capable VLM — and it's not clear such a model exists.

Ablation 4: Cross-Domain Transfer

Models post-trained on synthetic Habitat Synthetic Scenes Dataset (HSSD) and evaluated on real-world scanned HM3D/MP3D scenes:

The synthetic-to-real gap is 1-4 pp for AR and 3-5 pp for ImageNav. Remarkably small. The action-conditioned representations learned during post-training generalize across scene distributions, even from synthetic to real scanned environments.

Why cross-domain transfer works: The action-to-motion mapping is largely environment-independent. "Move forward 0.2m" produces roughly the same visual change whether you are in a synthetic kitchen or a real scanned living room. The visual appearance differs, but the motion pattern is universal. Post-training teaches the latter, and the frozen pretrained backbone handles the former.

What the Paper Doesn't Say (Assumptions and Limitations)

Every paper has blind spots. Here are the ones worth noting:

Controllability vs. Visual Quality: The Key Ablation

Each dot represents a model configuration. Left plot shows SR vs. visual quality (weak R²). Right shows SR vs. controllability (strong R²). Toggle to compare.

The paper shows that a simple LPIPS-based revision policy can outperform a complex VLM-based revision policy for ImageNav. Why does this happen?

Chapter 9: Connections

Cheat Sheet: Key Equations

Proposal (Eq. 1): Ât(m) ~ πproposal(A | ot, g),   m=1,...,M
• ot: current observation (RGB/RGB-D image)
• g: task goal (text, image, or bounding box)
• Ât(m): m-th candidate action plan of horizon L
• M: beam width (number of candidate plans)

Simulation (Eq. 2): Ôt(m) ~ gθ(O | ot, It(m))
• It(m) = C(Ât(m)): control input from the unified action API
• gθ: visual world model (video generator)
• Ôt(m) = [ôt+1, ..., ôt+L]: predicted future observations

Revision (Eq. 3): Dt* = πrevision({(Ât(m), Ôt(m))}m=1M, ot, g)
• Dt*: best decision at time t
• Can be score-and-select (Eq. 4) or synthesis-based

Score-and-Select (Eq. 4): m* = argmaxm S(Ât(m), Ôt(m) | ot, g)
• S: task-specific scoring function (VLM confidence, LPIPS to goal, etc.)

Action API: I = C(A) — translates agent actions to model-specific control inputs
• Text: concatenated action phrases
• Camera: {(xk, yk, φk)}k=1K
• Low-level: mapped action codes Aworld

Related Lessons on This Site

Key Takeaways for Practitioners

When to use a visual world model for embodied planning:
Use when: The task involves perception under uncertainty (occlusion, extreme viewpoints), navigation planning (which path looks best?), or when the base policy is weak and needs augmentation.
Don't use when: The task requires precise physical dynamics (contact-rich manipulation), real-time decisions (<100ms per step), or when the action space is trivial (1-2 options).
Post-train always: Even a few thousand action-observation examples dramatically improve controllability. Zero-shot video generators are not competitive with post-trained ones for embodied tasks.
Budget inference compute wisely: More rollouts help, but with diminishing returns. The steepest gains come from going from 2-3 rollouts to 6-8. Beyond 10-11, improvements plateau.

Open Problems

The Big Picture

World-In-World forces a paradigm shift in how we evaluate world models. Before this paper, the field optimized for FID and aesthetic scores — making predictions look like real videos. After this paper, the question becomes: do these predictions help agents act?

The answers are nuanced: world models help substantially for perception and navigation, modestly for question answering, and barely for manipulation. Controllability matters more than visual quality. Post-training on action-observation data is cheap and effective. More inference-time compute reliably buys more success.

The analogy that sticks: before World-In-World, we were judging chess engines by how pretty their board renderings look. Now we are judging them by whether they win the game.

You are designing an embodied agent for a new navigation task. You have a pretrained Wan2.1 video generator (5B params) and can afford either (a) upgrading to Wan2.2 (14B params) or (b) collecting 40K action-observation examples from the target environment and post-training Wan2.1. Based on the paper's findings, which strategy should you choose?