World-In-World — Veanors

Chapter 0: The Problem

You have a state-of-the-art video generation model — Sora, Wan, Cosmos, Runway Gen4. It can synthesize photorealistic, temporally coherent videos of indoor scenes, outdoor environments, even robotic manipulation sequences. The visual quality is stunning. High FID, great aesthetics scores, passing VBench with flying colors.

Now you give this model to an embodied agent — a robot navigating a house, an arm sliding a block onto a target. The agent uses the model to imagine "what happens if I turn left?" before committing to an action. This is model-based planning: simulate possible futures, pick the best one, act.

Here is the uncomfortable question nobody was answering: do these beautiful imagined futures actually help the agent succeed?

The Open-Loop Evaluation Trap

Before World-In-World, every benchmark for world models worked like this:

Give the model an image and an action (or text prompt).
The model generates a predicted future frame (or video).
Compare the generated frame to a ground-truth frame using visual quality metrics: FID, LPIPS, SSIM, aesthetic scores.

This is open-loop evaluation. The agent never acts on its predictions. There is no environment feedback. The model generates one video and you judge how pretty it looks.

The fundamental gap: Open-loop evaluation tells you how realistic a world model's predictions look. It says nothing about whether those predictions are useful for decision-making. A model could produce visually stunning but physically impossible futures — a hallucinated door that doesn't exist, a table that clips through the wall — and still score well on FID. But an agent that plans based on those hallucinations will fail catastrophically.

Consider a concrete example. An agent is in Habitat-Sim, tasked with identifying a heavily occluded object. It needs to decide: turn left or turn right? It uses a world model to imagine both futures. The model that generates controllable, physically consistent videos (even if slightly blurry) helps the agent see around the occlusion. The model that generates beautiful but uncontrollable videos (ignoring the intended action, reverting to its web-video training priors) leaves the agent blind.

Which model wins on VBench? The beautiful one. Which model actually helps the agent? The controllable one.

Closed-Loop: What Actually Matters

Closed-loop evaluation works completely differently:

The agent observes the real environment.
It proposes multiple candidate action plans.
The world model simulates each plan — predicting what will happen.
The agent picks the best plan and executes it in the real environment.
The environment returns a new observation.
Repeat from step 1.

The metric is not "how realistic was the imagined future?" It is "did the agent succeed at its task?" Success rate, path efficiency, answer accuracy. The things that matter in the real world.

The analogy: Open-loop evaluation is like grading a chess engine by how realistic its board visualizations look. Closed-loop evaluation is like grading it by whether it wins the game. World-In-World is the first benchmark that plays the game.

Before World-In-World, these benchmarks existed:

VBench (Huang et al., 2024): Evaluates video generation quality — aesthetics, temporal consistency, subject identity. Pure visual metrics. No agent interaction.
WorldModelBench (Li et al., 2025): Judges visual plausibility of generated worlds. Still open-loop.
WorldScore (Duan et al., 2025): Unified assessment for image+trajectory input models. Closer, but still no closed-loop agent interaction.
VP2 (Tian et al., 2023): Measures video prediction utility for planning. Simple setup, limited task diversity, older architectures.

None of them asked the fundamental question: do world models help agents succeed at embodied tasks in a closed-loop setting?

Open-Loop vs. Closed-Loop Evaluation

Toggle between the two paradigms. In open-loop, the model generates once and is scored on visual quality. In closed-loop, the agent acts, observes, replans — success is measured by task completion.

Why can a world model score well on visual quality benchmarks like VBench yet fail to help an embodied agent succeed?

Because visual quality measures realism of generated frames, not whether the predictions are controllable, physically consistent, or useful for planning — a beautiful but uncontrollable model gives the agent hallucinated futures to plan on Because VBench uses the wrong image resolution Because embodied agents need faster inference speed than video generation allows

Chapter 1: The Key Insight

World-In-World's central finding can be stated in one sentence: visual quality and task success are poorly correlated — controllability is what actually predicts embodied performance.

This is surprising. You would expect that the best-looking world model produces the most useful imagined futures. After all, if the predicted frame perfectly matches reality, the agent can plan on it as if it were ground truth. But "perfectly matches reality" requires two things: (1) the image looks realistic, and (2) the image accurately reflects the intended action. Most current video generators nail (1) and fail at (2).

The Visual Quality vs. Task Success Scatter Plot

Figure 2 of the paper is the money shot. The authors plot task success rate (SR) against generation quality (averaged aesthetic + image quality scores from VBench) for every model they benchmark. If visual quality predicted task success, you would see a tight positive correlation — points clustering along a diagonal from bottom-left to top-right.

Instead, the points form a diffuse cloud. Runway Gen4, a proprietary model with the highest visual quality, also achieves the highest success rate (~65%). But Cosmos-P2 has lower visual quality than several models that outperform it on tasks. Post-trained models (marked with †) consistently outperform their zero-shot base versions despite often having comparable or slightly lower visual quality scores.

Three surprises from the paper:
1. Visual quality alone doesn't guarantee task success. Controllability — whether the model's predictions actually reflect the intended actions — matters more.
2. Scaling post-training data is more effective than upgrading the base model. Wan2.1† with 80K action-observation examples outperforms Wan2.2 (A14B), a model with 14B active parameters, despite being much smaller.
3. More inference-time compute = better closed-loop performance. Increasing the number of imagined rollouts per decision step from 3 to 11 raises Active Recognition success from 53.36% to 60.98%.

Controllability: The Missing Metric

The paper defines controllability as the alignment between intended actions and the motions actually depicted in the model's predictions. They quantify it as 1 − LPIPS between ground-truth and predicted observations. LPIPS (Learned Perceptual Image Patch Similarity) measures perceptual distance — lower LPIPS means the predicted frame looks more like what should have happened given the action.

When you plot SR against controllability instead of visual quality, the correlation is dramatically tighter (Figure 5b in the paper). Models that faithfully translate actions into visual predictions consistently achieve higher task success. This makes perfect sense: if the agent asks "what happens when I turn left?" and the model shows what happens when you turn right, the generated frame can be stunning — but the plan is garbage.

Why video generators have poor controllability: Most SoTA video models (Wan, LTX-Video, Hunyuan) are trained on web-scale video data with text prompts. They learn to generate plausible videos, but they learn statistical priors from web video, not physics-faithful action-conditioned predictions. When you prompt "the camera moves forward 0.6m," the model may generate a visually plausible forward motion — but not the precise 0.6m translation the agent intended. Post-training on action-observation data from the target environment fixes this by aligning the model's outputs with actual physics.

The Design Philosophy

World-In-World is built on three principles:

Task success is the primary metric. Not FID, not LPIPS, not aesthetic score. Did the agent accomplish its goal?
Any world model should be pluggable. The unified action API translates between the agent's action space and whatever format the world model expects (text, camera trajectory, low-level actions).
Closed-loop interaction is non-negotiable. The agent must observe → plan → act → re-observe → re-plan. One-shot prediction is not enough.

The MPC connection: World-In-World generalizes Model Predictive Control (MPC). In classical MPC, you have an explicit dynamics model f(s, a) = s', you simulate M candidate action sequences, you pick the best one based on a reward function, and you execute the first action. World-In-World does the same thing but replaces the dynamics model with a visual world model (video generator), and the reward function with a learned revision policy that can be a VLM, a perceptual similarity metric, or a rule-based heuristic.

Visual Quality vs. Controllability vs. Task Success

Click to toggle between the two scatter plots: SR vs. visual quality (weak correlation) and SR vs. controllability (strong correlation). Each dot is a world model.

The paper finds that controllability correlates more strongly with task success than visual quality. What is controllability measuring?

The alignment between the intended action and the motion actually depicted in the model's predicted frames, quantified as 1 - LPIPS between ground-truth and predicted observations The number of actions the model can execute per second The aesthetic quality of the generated video frames

Chapter 2: Closed-Loop Online Planning

World-In-World's planning framework has three phases that cycle continuously: propose, simulate, revise. Think of it as the agent asking three questions in a loop:

"What could I do next?" (Propose M candidate action plans.)
"What would happen if I did each one?" (Simulate each plan with the world model.)
"Which future looks best?" (Score and select the best plan, then execute it.)

After execution, the agent receives a new real observation and the cycle repeats. This is the closed loop: imagination informs action, action produces reality, reality updates imagination.

Step 1: Proposal — π_proposal

At time step t, the agent has observation o_t (an egocentric RGB or RGB-D image) and task goal g (e.g., "identify the object in the red bounding box" or a target image for navigation). The proposal policy samples M candidate action sequences:

Â_t^(m) ~ π_proposal(A | o_t, g), m = 1, ..., M

Each Â_t^(m) = [â_t+1, â_t+2, ..., â_t+L] is a sequence of L elementary actions. Each â is drawn from the agent's action vocabulary V (e.g., {move_forward, turn_left, turn_right, stop} for navigation, or continuous 7-DoF gripper commands for manipulation).

What π_proposal can be:
• A VLM (vision-language model): Given the current image and goal, the VLM reasons about which actions might make progress. This is the default for AR, A-EQA, and ImageNav.
• A diffusion policy: For manipulation, a 3D diffusion policy proposes continuous gripper trajectories.
• A heuristic: For simple tasks, you can even enumerate all possible single-action plans (M = |V|).
The key: π_proposal is the same base policy the agent would use without a world model. The world model adds value by evaluating plans before execution, not by generating them.

Step 2: Simulation — The World Model g_θ

Each candidate plan must be translated into the format the world model expects. This is where the unified action API C enters (Chapter 3). The transformed control input is:

I_t^(m) = C(Â_t^(m))

The world model then generates a counterfactual rollout — a sequence of predicted future observations:

Ô_t^(m) ~ g_θ(O | o_t, I_t^(m)), Ô_t^(m) = [ô_t+1^(m), ô_t+2^(m), ..., ô_t+L^(m)]

Here g_θ can be any visual world model: SVD (Stable Video Diffusion), Wan2.1, Cosmos-Predict2, LTX-Video, etc. The model takes the current real observation o_t and the control input I_t^(m), and outputs L predicted frames. These frames represent "what the agent would see" if it executed plan m.

Data flow through the world model (concrete example):
Input: o_t = RGB image [480×640×3] of a hallway in Habitat-Sim, I_t^(m) = text prompt "The camera moves forward by 0.6m" (for text-conditioned models) or camera trajectory [(0,0,0°), (0.2,0,0°), (0.4,0,0°)] (for camera-conditioned models).
Process: The video diffusion model encodes o_t into a latent z₀, runs reverse diffusion conditioned on I_t^(m), decodes L=4 future latents into pixel-space.
Output: Ô_t^(m) = 4 predicted RGB frames [480×640×3] showing what the hallway looks like 1, 2, 3, 4 steps ahead under this action plan.

Step 3: Revision — π_revision

The revision policy scores all M simulated rollouts and picks the best one:

D_t^* = π_revision({(Â_t^(m), Ô_t^(m))}_m=1^M, o_t, g)

The simplest instantiation is score-and-select: compute a task-specific score S for each rollout and pick the argmax:

D_t^* = Â_t^(m*), where m* = argmax_{m ∈ {1,...,M}} S(Â_t^(m), Ô_t^(m) | o_t, g)

What is S? It depends on the task:

Active Recognition: A VLM examines the predicted frames and the real observation together, then answers the recognition query with higher confidence. The score is the VLM's confidence in its answer.
ImageNav: The score is LPIPS similarity between the predicted last frame ô_t+L^(m) and the goal image g. Lower LPIPS = the predicted future looks more like the destination.
A-EQA: Same as AR — the VLM answers the question using both real and simulated observations.
Manipulation: A 3D diffusion policy scores candidate gripper trajectories by predicted task completion.

Beyond score-and-select: The paper notes that π_revision can do more than just picking the highest-scoring candidate. It can synthesize a new decision by aggregating information across all candidates and their predicted consequences. For example, a VLM might reason: "Plan 1 shows the object is to the left, Plan 2 shows a wall ahead. Therefore I should combine: turn left first, then move forward." This makes the framework strictly more general than classical MPC, which only selects from proposed action sequences.

The Full Loop (Worked Example)

Let's trace one complete cycle for Active Recognition:

t=0: Agent sees a room from an extreme angle. Target object (a lamp) is 80% occluded. VLM confidence: 35%.
Propose: The VLM suggests M=2 plans: Â⁽¹⁾ = [turn_left, turn_left, move_forward, move_forward] and Â⁽²⁾ = [turn_right, move_forward, move_forward, move_forward]. L=4 steps each.
Simulate: World model generates 4 future frames for each plan. Plan 1 reveals the lamp from a better angle. Plan 2 shows a wall.
Revise: VLM examines simulated frames + real observation. With Plan 1's futures, confidence rises to 72%. With Plan 2's, only 40%. Score(Plan 1) > Score(Plan 2).
Execute: Agent executes the first action of Plan 1: turn_left. Environment returns new real observation o₁.
t=1: Re-enter the loop with o₁. The lamp is now partially visible. VLM confidence: 55%. Propose new plans, simulate, revise, execute.
t=3: After 3 steps, VLM confidence exceeds 95% threshold. Agent outputs its answer: "lamp." Correct!

Without the world model, the base VLM policy just greedily picks the action with highest immediate expected reward at each step. It might wander aimlessly — taking 8 steps instead of 3, or going the wrong direction entirely.

Closed-Loop Planning Pipeline

Click "Step" to advance through the proposal-simulate-revise-execute cycle. Watch how the agent uses imagined futures to pick better actions.

Step 0: Observe

In the closed-loop planning framework, what makes the revision policy more general than classical Model Predictive Control (MPC)?

MPC is restricted to selecting from proposed action sequences; the revision policy can also synthesize new decisions by aggregating information across all candidates and their predicted consequences The revision policy uses a neural network while MPC uses analytical optimization MPC cannot handle visual observations

Chapter 3: The Unified Action API

Here is the problem the action API solves: the agent speaks one language ("turn left, move forward 0.6m"), but each world model speaks a different language. Some want text prompts. Some want camera trajectories as (x, y, φ) tuples. Some want raw low-level action codes. Without a translator, you would need to rewrite the entire planning pipeline for every world model you want to benchmark.

The unified action API C is that translator. It maps the agent's action sequence A into the control inputs I that the specific world model expects:

I = C(A)

This single interface lets the same agent, same proposal policy, and same revision policy work with any world model. Swap in SVD, Wan2.1, Cosmos-P2, Runway Gen4, or any future model — only C changes, not the planning logic.

Three Control Modalities

C supports three output formats, matched to the three types of conditioning that current world models accept:

1. Text Prompts (Image-and-Text-to-Video Models)

For models like Wan2.1, LTX-Video, and Hunyuan, the controller converts each primitive action into a descriptive phrase using a predefined template, then concatenates them:

Worked example — text prompt generation:
Agent's action sequence: [move_forward, move_forward, turn_left]
Template mapping: move_forward → "The camera moves forward by 0.2m"
turn_left → "The camera rotates left by 22.5°"
Concatenated prompt I_text: "The camera moves forward by 0.2m. The camera moves forward by 0.2m. The camera rotates left by 22.5°."

This is fed to the video diffusion model alongside the current observation image o_t. The model generates L frames conditioned on this text description of the intended motion.

The precision problem is immediately visible: "moves forward by 0.2m" is semantically clear to a human, but the video model has no grounding for what 0.2m looks like in this specific scene. It may generate a plausible-looking forward motion that is actually 0.5m or 0.1m. This is why text-conditioned models tend to have lower controllability.

2. Camera Trajectory / Viewpoint (3D-Aware Models)

For models like SE3DS, PathDreamer, and NWM (Navigation World Model) that explicitly consume camera poses, the controller translates each action into a geometric transformation:

Worked example — camera trajectory:
Agent's action sequence: [move_forward, turn_right, move_forward]
Each move_forward: translate camera by (0.2, 0, 0) in the current heading direction.
Each turn_right: rotate azimuth by -22.5°.
Resulting trajectory: [(x₀, y₀, φ₀), (x₀+0.2cosφ₀, y₀+0.2sinφ₀, φ₀), (x₁, y₁, φ₀-22.5°), (x₁+0.2cos(φ₀-22.5°), y₁+0.2sin(φ₀-22.5°), φ₀-22.5°)]

Formally: {(x_k, y_k, φ_k)}_k=1^K with (x_k, y_k) ∈ R² and azimuth φ_k ∈ R.

Camera-conditioned models have an inherent advantage for controllability: the geometric transformation is exact. There is no ambiguity about what "0.2m forward" means — it is a precise translation vector. The model may still hallucinate scene content, but the camera motion itself is accurately specified.

3. Low-Level Actions (Action-Conditioned Models)

For models like SVD† (post-trained) that directly consume discrete or continuous actions, the controller maps the agent's action vocabulary to the world model's action vocabulary:

A ↦ A_world

This mapping handles vocabulary mismatches: the agent might use {forward, left, right, stop} while the model was trained with {action_0, action_1, action_2, action_3}. The API maintains a bijection between the two vocabularies.

For robotic manipulation, the mapping is more complex: the agent proposes a 7-DoF gripper trajectory [(x, y, z, roll, pitch, yaw, grip)]_1:L, and the API may need to discretize or re-parameterize this to match the world model's expected format.

Why the action API is a key contribution: Without it, every (world model, task) pair requires custom engineering. With N world models and K tasks, that is N×K integration efforts. The unified API reduces this to N + K: implement C once per model (N adaptors), implement the planning loop once per task (K configurations), and any model works with any task. This is what makes the benchmark scalable to future models.

The Full Translation Pipeline

Agent Action

A = [move_forward, turn_left, move_forward, move_forward]
From the agent's discrete action vocabulary V

↓

API Dispatch

C detects the target world model type and selects the appropriate conversion path: text → I_text, camera → I_cam, or low-level → I_action

↓

Control Input

For Wan2.1 (text): "The camera moves forward by 0.2m. The camera rotates left by 22.5°..."
For SE3DS (camera): {(0,0,0°), (0.2,0,0°), (0.2,0,22.5°), ...}
For SVD† (action): [0, 3, 0, 0] (action indices)

↓

World Model

g_θ(O | o_t, I) produces L predicted future frames
Same observation o_t, same intended actions, different model — directly comparable outputs

Unified Action API: Translation Paths

Select an action sequence and see how the API translates it into three different control formats. Click the model buttons to switch between text, camera, and low-level action representations.

Why do camera-trajectory-conditioned models tend to have higher controllability than text-prompted models?

Camera trajectories specify the exact geometric transformation (translation + rotation) with no ambiguity, while text descriptions like "move forward 0.2m" lack grounding in the specific scene — the model may generate any plausible-looking motion Camera trajectory models are always larger and more expressive Text-conditioned models cannot generate more than one frame

Chapter 4: The Four Embodied Tasks

World-In-World tests world models on four tasks, each stressing a different capability. Together they cover the full spectrum of embodied intelligence: perception, navigation, language reasoning, and manipulation.

Why four tasks, not one? A world model might be excellent at predicting what is around the corner (helping perception) but terrible at predicting how objects move when pushed (failing at manipulation). A single task would give a misleadingly narrow picture. Four tasks spanning perception, navigation, reasoning, and physical interaction force a model to demonstrate general utility.

Task 1: Active Recognition (AR)

Environment: Habitat-Sim on Matterport3D scenes (29 scenes, 551 episodes).

Goal: Identify a designated target object that is heavily occluded or viewed from an extreme angle. The agent can move to get a better view.

Action space: Navigation primitives: {move_forward, turn_left, turn_right, stop}.

Budget: K = 10 decision steps maximum.

Observation: RGB image (panoramic + front view) with the target marked by a red bounding box.

How the world model helps (two ways):

Perception enhancement: The VLM sees both the real observation AND the simulated future views when answering "what is this object?" Synthetic views from different angles provide additional evidence that helps resolve ambiguity from occlusion.
Planning enhancement: Before committing to move_forward or turn_left, the agent simulates both options. The rollout that reveals more of the target object gets a higher score.

Metrics: Success Rate (SR) = fraction of episodes where the final predicted label matches the ground-truth label. Mean Trajectory Length = average number of steps before the agent makes its final prediction or exhausts the budget K.

Concrete numbers: The best proprietary model (Runway Gen4) achieves 64.79% SR with mean 4.06 steps. The best open-source post-trained model (Wan2.1†) achieves 62.98% SR with mean 4.71 steps. The VLM base policy without any world model achieves only 50.27% SR with mean 6.24 steps. That is a +12.7 percentage point improvement from adding a world model, plus the agent reaches its answer 2 steps faster.

Task 2: Image-Goal Navigation (ImageNav)

Environment: Habitat-Sim on HM3D scenes (87 scenes, 144 episodes).

Goal: Navigate to the location from which a given goal image was captured.

Action space: Same navigation primitives.

Observation: RGB image (current view).

Goal input: A single reference RGB image showing the target viewpoint.

How the world model helps: The agent simulates candidate navigation plans and compares each predicted final frame against the goal image using LPIPS. The plan whose predicted outcome looks most like the goal image is selected. This is pure perceptual planning — the world model acts as a visual lookahead.

Metrics: SR = does the agent reach within a threshold distance of the goal? SPL = Success weighted by Path Length (penalizes inefficient paths). Mean Trajectory Length.

Concrete numbers: Best post-trained model (Wan2.1†) achieves 45.14% SR with mean path 45.8 steps. VLM base policy without world model: 35.42% SR with mean 47.5 steps. Shorter path AND higher success rate with the world model.

Task 3: Active Embodied Question Answering (A-EQA)

Environment: Habitat-Sim on HM3D scenes (54 scenes, 184 questions from OpenEQA).

Goal: Answer open-ended natural language questions (e.g., "How many cushions are on the red sofa?") by actively exploring a 3D environment.

Action space: Navigation primitives.

Observation: RGB + panoramic view.

How the world model helps: Both perception and planning, as in AR. For answering: simulated views help the VLM see objects from complementary angles to resolve references to occluded or distant objects. For navigation: rollouts guide the agent to explore views likely to reveal question-relevant information.

Metrics: Answer Score (GPT-judged, 0-1 scale measuring answer quality), Mean Trajectory Length, SPL.

Concrete numbers: Best model (Wan2.2† A14B) achieves answer score 48.4 and SPL 31.9, surpassing the VLM base policy at 45.7 answer score and 29.6 SPL.

Task 4: Robotic Manipulation

Environment: CoppeliaSim on RLBench tasks (4 tasks, 50 episodes each).

Goal: Control a 7-DoF robotic arm to complete manipulation tasks like "slide the red block onto the blue target."

Action space: Continuous 7-DoF gripper commands [(x, y, z, roll, pitch, yaw, grip)].

Observation: Third-person RGB images of the workspace.

How the world model helps: The agent generates candidate gripper trajectories, the world model predicts the visual outcome of each trajectory (how objects will move), and the agent selects the trajectory most likely to achieve the specified objective. This requires the world model to understand physical dynamics — contact, friction, sliding.

Metrics: SR = task completion rate. Mean Trajectory Length.

The manipulation gap: World models struggle here. The best model (SVD†) achieves only 46.5% SR, barely above the VLM baseline at 44.5%. Why? Manipulation requires modeling contact-rich interactions, compliance, friction, and deformable objects. Current video generators trained on web video have almost no training signal for these physics. Predicting "what happens when a gripper pushes a block" is fundamentally harder than predicting "what does the hallway look like from 2 meters ahead." This is the clearest gap identified by the benchmark.

The Four Embodied Tasks

Click each task to see its setup: environment, action space, how the world model helps, and key results. The bars show success rates with and without the world model.

Why do world models provide much less improvement for robotic manipulation compared to active recognition and navigation?

Manipulation requires modeling contact-rich physical interactions (friction, compliance, object dynamics) that web-video-trained generators have almost no training signal for — predicting how objects move when pushed is fundamentally harder than predicting scene appearance from a new viewpoint Manipulation tasks use a smaller action space The manipulation environment renders at lower resolution

Chapter 5: The Post-Training Recipe

Off-the-shelf video generators like Wan2.1 produce visually appealing clips, but they are driven by text prompts and have limited fine-grained low-level control. Without adaptation, they yield only small gains on downstream embodied tasks. The post-training recipe fixes this by fine-tuning pretrained video generators on action-observation data from the target environment.

The core idea: take a powerful visual prior (from web-scale video pretraining) and teach it to respond precisely to embodied actions. The generator keeps its ability to produce realistic scenes but gains the controllability needed for closed-loop planning.

Where the Post-Training Data Comes From

The data is collected from the same simulators used for evaluation, but from disjoint scenes — the training and evaluation environments never overlap. This ensures the world model learns generalizable action-conditioned representations, not memorized scene layouts.

Habitat-Sim tasks (AR, A-EQA, ImageNav): Post-train on the HM3D training split. Each data point is a panoramic action-observation pair: (o_t, a_t, o_t+1) where o_t is the current panoramic observation, a_t is the navigation action executed, and o_t+1 is the resulting observation.
CoppeliaSim tasks (Manipulation): Post-train on RLBench task demonstrations. Each data point is (observation frame, gripper action, next observation frame).

Data format (detailed): For Habitat-Sim, each training example consists of:
1. A panoramic input image (equirectangular projection, capturing 360° around the agent)
2. An action label from the discrete navigation vocabulary
3. The resulting panoramic observation after executing that action

The model learns to predict frame 3 given frames 1-2. During post-training, the panoramic images are converted to perspective views matching the same field-of-view used during evaluation. This panorama-to-perspective conversion introduces a resolution loss that can slightly degrade generation quality (Table 4 in the paper).

The Post-Training Procedure

The recipe is straightforward LoRA fine-tuning of the video diffusion model:

Freeze the pretrained video generator's main weights.
Add LoRA adaptors to the attention layers (low-rank matrices that modify the model's behavior with minimal parameter overhead).
Train on the action-observation dataset for 1 epoch. The loss is the standard video diffusion denoising objective, but now conditioned on action inputs instead of arbitrary text.
No additional tricks — no reward models, no reinforcement learning, no data augmentation beyond standard image preprocessing.

Why post-training works so well (the authors' hypothesis): Pretrained video generators already have a powerful visual prior — they know what indoor scenes, hallways, kitchens, and object surfaces look like. What they lack is grounding: the ability to translate "move forward 0.2m" into a specific, metrically accurate camera motion in the current scene. Post-training provides exactly this grounding. A modest amount of action-conditioned data (even just 4K examples) is enough to significantly improve controllability, because the model only needs to learn the action-to-motion mapping, not how to generate scenes from scratch.

Frozen vs. Trained: What Changes?

This is the architecture of a post-trained world model:

FROZEN

Video U-Net / DiT backbone (Wan2.1, SVD, etc.) — pretrained on web-scale video
Contains the visual prior: scene geometry, appearance, temporal consistency
Parameters: 1.5B-14B depending on model

↓

TRAINED

LoRA adaptors on attention layers — fine-tuned on action-observation data
Learns: action grounding, metric accuracy, environment-specific dynamics
Parameters: ~0.5% of total (minimal overhead)

↓

FROZEN

VAE decoder — converts latents back to pixel space
Unchanged from pretraining

The elegant part: the frozen backbone provides rich visual understanding, while the tiny LoRA adaptor provides action controllability. This separation means post-training is cheap — 1 epoch on 40K examples is sufficient — and the visual quality from pretraining is largely preserved.

Cross-Domain Transfer

The paper also tests whether post-training transfers across scene distributions. Models post-trained on the synthetic Habitat Synthetic Scenes Dataset (HSSD) and tested on real-scanned HM3D/MP3D scenes still show clear gains:

SVD† post-trained on HSSD: 58.98% AR SR (vs. 50.27% VLM baseline)
Wan2.1† post-trained on HSSD: 62.98% AR SR
Best in-domain (HM3D train): SVD† achieves 60.98% AR SR, 43.05% ImageNav SR

The synthetic-to-real gap exists but is smaller than you might expect. Post-training learns action-conditioned visual representations that transfer — consistent with prior work on adaptable world models (Gao et al., 2025).

Post-Training Effect on Controllability

Drag the slider to vary the amount of post-training data. Watch how controllability (1-LPIPS alignment between intended and actual actions) improves, and how this maps to task success rate.

Post-training examples: 0

Why can a small amount of post-training data (even 4K examples) dramatically improve a pretrained video generator's embodied performance?

The pretrained model already has a powerful visual prior (scene appearance, geometry, temporal consistency) — it only needs to learn the action-to-motion grounding, which is a much simpler mapping than generating scenes from scratch 4K examples are enough to memorize all possible scenes Post-training replaces the entire model architecture with a task-specific one

Chapter 6: Benchmark Results

The paper evaluates 6+ video generators, 3+ task-focused world models, and their post-trained variants across all four tasks. Let's walk through the results systematically.

The Leaderboard at a Glance

Here are the key numbers from Tables 1-3, grouped by insight:

Finding 1: World Models Consistently Help

Across AR, A-EQA, ImageNav, and Manipulation, adding a visual world model consistently improves the performance of the base proposal policy. This holds for every world model tested.

Base policy performance (no world model) vs. best world model augmented:
• AR: 50.27% → 64.79% (+14.52 pp) with Runway Gen4
• ImageNav: 35.42% → 45.14% (+9.72 pp) with Wan2.1†
• A-EQA: Answer score 45.7 → 48.4 (+2.7) with Wan2.2† A14B
• Manipulation: 44.5% → 46.5% (+2.0 pp) with SVD† (modest gain)

The improvement is largest for perception-heavy tasks (AR) and smallest for manipulation. Mean trajectory length also improves — agents reach their goals in fewer steps.

Finding 2: The Model Zoo

The paper benchmarks these models (grouped by type):

Image generators (single-frame prediction):

PathDreamer (Koh et al., 2021): Viewpoint-conditioned. Uses image inpainting to generate novel views. AR SR: 56.99%.
SE3DS (Koh et al., 2023): RGB-D + panoramic viewpoint conditioning. AR SR: 57.53%.

Video generators (multi-frame prediction, text-conditioned):

SVD (Blattmann et al., 2023): Stable Video Diffusion, 1.5B params, image-conditioned. AR SR: 57.89% (zero-shot).
LTX-Video (HaCohen et al., 2024): 2B params, text+image. AR SR: 56.08%.
Hunyuan (Kong et al., 2024): 13B params, text+image. AR SR: 57.71%.
Wan2.1 (Wan et al., 2025): 1.3B/5B/14B params, text+image. AR SR: 55.35% (5B zero-shot), 62.98% (14B post-trained).
Wan2.2 5B / A14B: Improved Wan with mixture-of-experts. AR SR: 59.53% (A14B zero-shot).
Cosmos-Predict2 (Agarwal et al., 2025): 2B params. AR SR: 55.35%.
Runway Gen4 (proprietary): Highest quality. AR SR: 64.79% (but proprietary, so no post-training).

Navigation world models (viewpoint/trajectory-conditioned):

NWM (Bar et al., 2025): Navigation-specialized, trajectory input. AR SR: 57.35%. 1B params.

Finding 3: Size Isn't Everything

Wan2.2 A14B (14B active parameters, mixture-of-experts) achieves 59.53% AR SR in zero-shot mode. Wan2.1† (1.3B parameters, post-trained) achieves 62.98%. The smaller post-trained model beats the larger zero-shot model by 3.45 percentage points. This is the paper's most striking result about post-training: adaptation beats scale in this regime.

Why adaptation beats scale: The 14B model has more parameters but no exposure to action-conditioned data. It generates beautiful hallways, but when told "turn left 22.5°," it may generate a visually plausible scene that corresponds to turning 45° or not turning at all. The 1.3B post-trained model has seen thousands of (action, resulting view) pairs and has learned precise action-to-motion mappings. Its predictions are less pretty but more accurate — and accuracy is what matters for planning.

Finding 4: Front View vs. Panorama

Table 4 compares models post-trained on front-view vs. panoramic input. Panoramic input provides a 360° field of view, giving richer spatial context. Surprisingly, panoramic input does not consistently improve performance across all settings:

AR with SVD†: Panorama SR = 60.08% vs. Front View SR = 57.89%. Panorama wins.
ImageNav with SVD†: Panorama SR = 43.05% vs. Front View SR = 38.19%. Panorama wins.
AR with Wan2.1†: Panorama SR = 62.25% vs. Front View SR = 62.61%. Front view slightly wins(!)

The likely explanation: converting panoramic equirectangular images to the perspective field-of-view used during evaluation introduces a resolution loss. The resolution degradation sometimes outweighs the benefit of the richer spatial context.

Model Leaderboard: Success Rate Across Tasks

Each row is a world model. The bars show success rate on each task. Post-trained models are highlighted. Sort by clicking the column headers.

Wan2.1-dagger (1.3B, post-trained) outperforms Wan2.2 A14B (14B, zero-shot) on Active Recognition. What does this tell us about the relative value of scale vs. adaptation for embodied world models?

Action-conditioned post-training (adaptation to the target action space) is more effective than scaling model size alone, because the smaller model has learned precise action-to-motion mappings that the larger model lacks despite its superior visual generation ability Smaller models always outperform larger ones on embodied tasks The 14B model was not properly evaluated

Chapter 7: Scaling Laws

World-In-World presents the first scaling laws for world models in embodied settings. Two axes of scaling are studied: training-time data scaling (how much post-training data?) and inference-time compute scaling (how many rollouts per decision step?).

Data Scaling: More Post-Training Data = Better Performance

The paper post-trains three models (Wan2.1†, Wan2.2†, SVD†) on datasets of varying size: 400, 2K, 4K, 20K, 40K, and 80K action-observation examples. Each model is post-trained for exactly 1 epoch (the total compute scales linearly with dataset size).

The results (Figure 6) show a clear, consistent upward trend:

Wan2.1†: AR SR rises from 60.25% (400 examples) to 63.34% (80K examples). That is +3.09 pp from 200x more data.
SVD†: AR SR rises from 56.44% (400 examples) to 60.98% (80K examples). That is +4.54 pp.
Wan2.2† (A14B): Starting from a much larger web-pretrained base, it reaches nearly the same performance as Wan2.1† after only 40K examples. Larger models benefit more from smaller amounts of post-training data, but saturate faster.

The diminishing-returns pattern: The steepest improvement happens in the first 4K examples. From 0 to 4K, the model learns the basic action-to-motion mapping (which direction is "left"?, what does "0.2m forward" look like?). From 4K to 80K, it refines this mapping across diverse scenes and edge cases. This matches the intuition from Chapter 5: the visual prior is already strong from pretraining; the model just needs to calibrate its action grounding.

Crucially, no model has saturated at 80K. The curves are still rising. More post-training data would likely continue to improve performance. This is the "first data scaling law for world models in embodied settings."

Inference-Time Scaling: More Rollouts = Better Decisions

The second scaling axis is the number of world-model inferences (rollouts) per decision step. In the proposal-simulate-revise framework, the agent can generate M candidate plans and simulate L steps for each. More rollouts = more candidate futures to choose from = better decisions.

The paper varies the average number of world-model inferences per episode and plots SR (Figure 7):

Wan2.1†: AR SR rises from ~57% (3 inferences/episode average) to 63.47% (11 inferences/episode). That is +6.47 pp from roughly 3.7x more inference compute.
SVD†: AR SR rises from 53.36% to 60.98% over the same range. That is +7.62 pp.

The compute-performance tradeoff: Each additional rollout costs one forward pass through the video diffusion model (typically 20-50 denoising steps). For SVD (1.5B params), each rollout takes ~2-5 seconds on an A100. For Wan2.1 (5B params), ~5-15 seconds. Increasing from 3 to 11 rollouts per step means ~3.7x more wall-clock time at each decision step.

But the improvement is substantial: +7 pp of SR from more thinking time. This is the embodied-world-model analogue of inference-time compute scaling in language models (more CoT tokens = better reasoning). The agent literally benefits from "thinking harder" — imagining more possible futures before acting.

Putting Both Scaling Laws Together

The two scaling axes are complementary:

Data scaling improves the quality of each individual rollout. Better post-training = more faithful predictions per rollout.
Inference-time scaling improves the coverage of the plan space. More rollouts = the agent explores more candidate futures per step.

The optimal strategy combines both: post-train with as much action-observation data as available, then allocate as much inference-time compute as the time budget allows. Neither axis alone is sufficient — a poorly trained model benefits less from more rollouts (garbage in, garbage out), and a well-trained model with too few rollouts may miss the best plan.

Worked example — inference-time compute tradeoff:
Scenario: Active Recognition episode, budget K=10 steps. SVD† takes 3 seconds per rollout.

Option A: M=2 rollouts per step. Per-step time: 6s. Total: 60s. SR: ~53%.
Option B: M=6 rollouts per step. Per-step time: 18s. Total: 180s. SR: ~58%.
Option C: M=11 rollouts per step. Per-step time: 33s. Total: 330s. SR: ~61%.

Going from 2 to 11 rollouts costs 5.5x more time but gains +8 pp SR. Depending on the application (real-time robot vs. offline planning), this tradeoff may or may not be worth it. But the existence of the scaling law — more compute reliably buys more success — is itself a significant finding.

Dual Scaling Laws: Data + Inference

Left: data scaling (SR vs. post-training examples). Right: inference-time scaling (SR vs. rollouts per step). Drag the vertical lines to explore different operating points.

The paper shows two independent scaling laws. What does each one improve, and why are they complementary?

Data scaling improves the quality (faithfulness) of each individual rollout, while inference-time scaling improves the coverage of the plan space — they are complementary because a well-trained model needs enough rollouts to find the best plan, and more rollouts are worthless if the rollouts themselves are unfaithful Data scaling increases model parameters, inference-time scaling increases batch size Both scaling laws only apply to manipulation tasks

Chapter 8: Ablations & Design Choices

The paper includes several important ablation studies that reveal which design decisions matter most. Let's walk through each one.

Ablation 1: Controllability vs. Visual Quality

This is the flagship finding. The paper plots SR against two metrics:

Generation quality = average of aesthetic predictor score (Akio Kodaira, 2024) and image quality predictor (Ke et al., 2021), both trained to match human preferences. This measures how "pretty" the generated frames look.
Controllability = 1 − LPIPS between ground-truth observations and model predictions. This measures how accurately the model's predictions reflect the intended actions.

Figure 5(a) shows SR vs. generation quality: weak, noisy correlation. R² is low. Models with similar quality scores have widely varying success rates.

Figure 5(b) shows SR vs. controllability: much tighter correlation. Models that faithfully translate actions into visual predictions consistently achieve higher SR. Post-trained models cluster in the high-controllability, high-SR region.

The interpretation: When you post-train a model, two things change. Visual quality may stay the same or even decrease slightly (because LoRA adaption can slightly degrade the pretrained generation ability). But controllability increases dramatically (because the model learns precise action-to-motion mappings). The fact that SR correlates with controllability, not quality, tells us that controllability is the bottleneck for embodied utility. Current models are already "good enough" visually; what they lack is the ability to faithfully execute the agent's intended actions in their predictions.

Ablation 2: Effect of Revision Policy

The revision policy π_revision scores candidate plans. The paper compares two revision strategies for ImageNav:

VLM-based revision: A vision-language model selects the candidate whose predicted frames best match the goal. SR: 43.05%, SPL: 30.96 (no world model).
LPIPS-based revision: Select the candidate whose predicted final frame has the lowest LPIPS distance to the goal image. Much simpler, no LLM needed.

Results (Table 5): Augmenting the planner with an action-conditioned world model AND applying a simple LPIPS-based revision policy yields the best results. SVD† with LPIPS revision achieves 47.92% SR and 39.82 SPL, compared to 43.05% SR and 30.96 SPL with VLM-only revision and no world model. Even SVD† with VLM revision achieves 45.14% SR.

The takeaway: A simple perceptual similarity metric (LPIPS) can be a better revision policy than a complex VLM for tasks where the goal is defined by an image. The LPIPS revision policy is also much faster (no LLM forward pass), making it practical for real-time applications. This demonstrates that the revision policy design space is rich and task-dependent.

Ablation 3: World Model Augmentation vs. Better Policy

A natural question: would it be better to skip the world model entirely and invest in a better base policy? Table 5 addresses this for ImageNav:

VLM base policy (no world model): 35.42% SR
VLM + SVD† world model (VLM revision): 45.14% SR (+9.72 pp)
VLM + Wan2.1† (VLM revision): 45.14% SR
VLM + SVD† (LPIPS revision): 47.92% SR (+12.50 pp)

Adding a world model gives a 10-12 pp boost. To get the same improvement from a better base policy alone (without imagined rollouts), you would need a dramatically more capable VLM — and it's not clear such a model exists.

Ablation 4: Cross-Domain Transfer

Models post-trained on synthetic Habitat Synthetic Scenes Dataset (HSSD) and evaluated on real-world scanned HM3D/MP3D scenes:

SVD† (HSSD post-train) → HM3D/MP3D eval: 58.98% AR SR, 38.89% ImageNav SR
SVD† (HM3D in-domain post-train) → HM3D/MP3D eval: 60.98% AR SR, 43.05% ImageNav SR
Wan2.1† (HSSD) → 62.98% AR SR, 42.36% ImageNav SR
Wan2.1† (HM3D train) → 62.61% AR SR, 45.14% ImageNav SR

The synthetic-to-real gap is 1-4 pp for AR and 3-5 pp for ImageNav. Remarkably small. The action-conditioned representations learned during post-training generalize across scene distributions, even from synthetic to real scanned environments.

Why cross-domain transfer works: The action-to-motion mapping is largely environment-independent. "Move forward 0.2m" produces roughly the same visual change whether you are in a synthetic kitchen or a real scanned living room. The visual appearance differs, but the motion pattern is universal. Post-training teaches the latter, and the frozen pretrained backbone handles the former.

What the Paper Doesn't Say (Assumptions and Limitations)

Every paper has blind spots. Here are the ones worth noting:

Perfect action execution assumed. In the real world, actions are noisy — "move forward 0.2m" might actually move 0.18m or 0.22m. The paper doesn't test robustness to action execution noise.
No real-robot experiments. All results are in simulation (Habitat-Sim, CoppeliaSim). The sim-to-real gap for world-model-based planning is unexplored.
Computational cost is high. Each decision step requires M forward passes through a video diffusion model. For Wan2.1 (5B params), this is many seconds per step. Real-time embodied agents cannot afford this.
Proposal policy quality is a ceiling. The world model can only evaluate plans the proposal policy generates. If the proposal policy never suggests the right action, the world model cannot fix this. The paper acknowledges: "stronger proposal and revision policies set the performance floor."
Single-step planning horizon. The agent executes only the first action of the best plan, then replans. For tasks requiring long-horizon reasoning (20+ steps), this may be suboptimal. The paper notes that "long-horizon planning with world models remains challenging."

Controllability vs. Visual Quality: The Key Ablation

Each dot represents a model configuration. Left plot shows SR vs. visual quality (weak R²). Right shows SR vs. controllability (strong R²). Toggle to compare.

The paper shows that a simple LPIPS-based revision policy can outperform a complex VLM-based revision policy for ImageNav. Why does this happen?

For image-goal navigation, the task objective is literally "reach the viewpoint matching this goal image" — LPIPS directly measures perceptual distance between the predicted view and the goal, which is an exact proxy for the reward, while the VLM introduces unnecessary reasoning complexity LPIPS uses a deeper neural network than the VLM The VLM was not properly fine-tuned for navigation

Chapter 9: Connections

Cheat Sheet: Key Equations

Proposal (Eq. 1): Â_t^(m) ~ π_proposal(A | o_t, g), m=1,...,M
• o_t: current observation (RGB/RGB-D image)
• g: task goal (text, image, or bounding box)
• Â_t^(m): m-th candidate action plan of horizon L
• M: beam width (number of candidate plans)

Simulation (Eq. 2): Ô_t^(m) ~ g_θ(O | o_t, I_t^(m))
• I_t^(m) = C(Â_t^(m)): control input from the unified action API
• g_θ: visual world model (video generator)
• Ô_t^(m) = [ô_t+1, ..., ô_t+L]: predicted future observations

Revision (Eq. 3): D_t* = π_revision({(Â_t^(m), Ô_t^(m))}_m=1^M, o_t, g)
• D_t*: best decision at time t
• Can be score-and-select (Eq. 4) or synthesis-based

Score-and-Select (Eq. 4): m* = argmax_m S(Â_t^(m), Ô_t^(m) | o_t, g)
• S: task-specific scoring function (VLM confidence, LPIPS to goal, etc.)

Action API: I = C(A) — translates agent actions to model-specific control inputs
• Text: concatenated action phrases
• Camera: {(x_k, y_k, φ_k)}_k=1^K
• Low-level: mapped action codes A_world

Related Lessons on This Site

Gleams: World Models — Builds intuition for what world models are and why they matter, from the "mental simulation" perspective. Start here if the concept of world models is new.
Gleams: RL Algorithms — Covers model-based RL and model predictive control (MPC), which World-In-World generalizes. Understanding MPC makes the proposal-simulate-revise loop intuitive.
Gleams: Robot Learning — Covers embodied AI, manipulation, and navigation tasks that World-In-World benchmarks.
Veanors: Stable Video Diffusion — SVD is one of the world models benchmarked in World-In-World. This lesson covers its architecture and training.
Veanors: MBPO — Model-Based Policy Optimization uses learned dynamics models for planning in continuous control. World-In-World extends this paradigm to visual world models.
Veanors: LingBot-VA — Causal world modeling for robot control with autoregressive diffusion. A different approach to the same problem World-In-World benchmarks.

Key Takeaways for Practitioners

When to use a visual world model for embodied planning:
• Use when: The task involves perception under uncertainty (occlusion, extreme viewpoints), navigation planning (which path looks best?), or when the base policy is weak and needs augmentation.
• Don't use when: The task requires precise physical dynamics (contact-rich manipulation), real-time decisions (<100ms per step), or when the action space is trivial (1-2 options).
• Post-train always: Even a few thousand action-observation examples dramatically improve controllability. Zero-shot video generators are not competitive with post-trained ones for embodied tasks.
• Budget inference compute wisely: More rollouts help, but with diminishing returns. The steepest gains come from going from 2-3 rollouts to 6-8. Beyond 10-11, improvements plateau.

Open Problems

Long-horizon planning: Current world models simulate short-term changes well but struggle on long horizons (>10 steps). Future work: spatial memory (Zhou et al., 2025), episode-level memory (Cai et al., 2025), surfel-indexed views (Li et al., 2025d).
Physical dynamics: Manipulation tasks expose the gap. Physics-guided motion generation (Wang et al., 2025a; Akkerman et al., 2025), inferring physical properties (Cao et al., 2025; Gillman et al., 2025), and physics-aware RL post-training (Wu et al., 2025) are promising directions.
Efficient inference: Real-time world-model architectures (Yang et al., 2025b; Kodaira et al., 2025), streaming inference (Huang et al., 2025), and distillation (Wang et al., 2025b; Agarwal et al., 2025) are needed to make this practical.
Real-world transfer: All experiments are in simulation. Closing the sim-to-real gap for world-model-based planning is the grand challenge.
Multi-agent settings: COMBO (Zhang et al., 2025a) explores compositional world models for multi-agent cooperation. Extending World-In-World's benchmark to multi-agent closed-loop evaluation is a natural next step.

The Big Picture

World-In-World forces a paradigm shift in how we evaluate world models. Before this paper, the field optimized for FID and aesthetic scores — making predictions look like real videos. After this paper, the question becomes: do these predictions help agents act?

The answers are nuanced: world models help substantially for perception and navigation, modestly for question answering, and barely for manipulation. Controllability matters more than visual quality. Post-training on action-observation data is cheap and effective. More inference-time compute reliably buys more success.

The analogy that sticks: before World-In-World, we were judging chess engines by how pretty their board renderings look. Now we are judging them by whether they win the game.

You are designing an embodied agent for a new navigation task. You have a pretrained Wan2.1 video generator (5B params) and can afford either (a) upgrading to Wan2.2 (14B params) or (b) collecting 40K action-observation examples from the target environment and post-training Wan2.1. Based on the paper's findings, which strategy should you choose?

Post-train Wan2.1 on 40K examples — the paper shows that scaling post-training data is more effective than upgrading the pretrained generator, because adaptation to the target action space (controllability) matters more than raw visual generation quality (which is already sufficient) Upgrade to Wan2.2 for better visual quality Neither — world models don't help with navigation

World-In-World: World Models in a Closed-Loop World