What Is a World Model?
A world model is a learned model of environment dynamics: given the current state and an action, it predicts what happens next. Instead of acting blindly in the real world and observing consequences, an agent with a world model can "imagine" the results of its actions before taking them.
This is, in many ways, what your brain does. You don't need to actually drop a glass to know it will shatter. You simulate the outcome internally — a mental model of physics, of cause and effect. World models give AI the same capability: a learned simulator that enables planning, prediction, and data-efficient learning.
The approach stands in contrast to model-free reinforcement learning, where agents learn purely from trial-and-error without building any internal representation of how the world works. Model-free methods are simple but sample-hungry: they may need millions of interactions to learn what a model-based agent can infer from imagination alone.
World models have evolved from simple state-space models in control theory to deep neural networks that can predict entire video sequences. The field sits at the intersection of reinforcement learning, generative modeling, and representation learning — and many researchers (notably Yann LeCun) argue that world models are the missing piece needed for human-level AI.
Architecture
The canonical world model architecture has three components arranged in a loop: an encoder that compresses observations into latent states, a dynamics model that predicts how latent states evolve, and a decoder that reconstructs observations from latent states. The imagination loop lets the agent roll out entire trajectories without touching the real environment.
Component Breakdown
- Encoder (obs → z): Typically a CNN or ViT that maps high-dimensional observations (images, point clouds) into a compact latent vector. In Dreamer, this uses a learned posterior q(zt | ot, ht).
- Dynamics Model (zt, at → zt+1): The core predictor. Usually an RSSM (Recurrent State-Space Model) in Dreamer, or a transformer in newer architectures. Predicts the next latent state given current state and action.
- Decoder (z → obs): Reconstructs observations for supervision. In JEPA, this component is deliberately omitted — predictions stay in latent space.
- Reward predictor (z → r): A small MLP that predicts reward from latent state. Needed for RL-based world models like Dreamer.
Core Paradigms
World models come in several flavors, each with different assumptions about what to predict, where to predict, and how to use the predictions. Click each card to expand its details.
Proposed by Yann LeCun as the blueprint for autonomous machine intelligence. JEPA avoids the pitfalls of generative pixel prediction by operating entirely in embedding space. The predictor learns abstract relationships — how objects move, what happens when forces are applied — without wasting capacity on irrelevant visual details like exact textures.
DreamerV3 (Hafner et al., 2023) is the state of the art in model-based RL. It uses a Recurrent State-Space Model (RSSM) with both deterministic and stochastic components, trains on imagined rollouts of 15 steps, and uses symlog predictions to handle varying reward scales. A single algorithm, fixed hyperparameters, 150+ tasks across diverse domains.
The boldest bet: if you train a large enough video model on enough data, you get an implicit simulator of the world. Genie (DeepMind) learns from unlabeled video and can generate playable 2D environments. UniSim and Cosmos scale this to photorealistic 3D simulation. The model learns physics, object permanence, and causality implicitly from visual patterns.
MPC is the classic use of world models: at each timestep, sample many action sequences, roll each forward through the learned model, evaluate cumulative reward, and execute the first action of the best sequence. Methods like CEM (Cross-Entropy Method) and MPPI are used for optimization. TD-MPC2 (Hansen et al., 2024) combines learned dynamics with temporal-difference learning for state-of-the-art results.
JEPA Deep Dive
Yann LeCun's Joint Embedding Predictive Architecture (2022) is not just another world model — it's a proposal for how intelligence itself should be structured. The core argument: predicting raw pixels is wasteful, because most pixel-level detail is irrelevant for planning and decision-making.
Why Not Predict Pixels?
Consider predicting what happens when you push a cup off a table. A pixel-level predictor must generate the exact texture of the cup, the precise pattern of shattering, the specific splash of liquid — details that are both computationally expensive and inherently unpredictable (chaotic dynamics at the pixel level). Yet the abstract outcome is simple: the cup falls, it breaks. JEPA captures this abstract prediction by operating in embedding space.
Where sx = Encθ(x), sy = Encξ(y) (EMA target), and z is a latent variable capturing uncertainty. Low energy = compatible pair. The system learns to assign low energy to correct predictions and high energy to incorrect ones.
JEPA vs Pixel Prediction
Pixel Prediction (Generative)
Must reconstruct ~150K pixels. Wastes capacity on texture, lighting, irrelevant visual detail. Loss is in pixel space.
JEPA (Latent Prediction)
Predicts ~4K abstract features. No decoder needed. Captures semantic content, ignores irrelevant detail. Loss is in embedding space.
The JEPA Family
- I-JEPA (Image JEPA, 2023): Predicts masked image patch representations. Outperforms MAE on semantic tasks while being more compute-efficient. Uses a Vision Transformer encoder with multi-block masking.
- V-JEPA (Video JEPA, 2024): Extends to video. Predicts representations of masked spatiotemporal regions. Learns motion, object permanence, and temporal dynamics without any pixel-level reconstruction.
- MC-JEPA (forthcoming): Multi-modal, cross-modal predictions in embedding space. The eventual goal: a unified world model that predicts across vision, language, and action.
Video as World Model
A radical idea: if you train a video generation model on enough internet video, you implicitly learn a simulator of the world. The model learns physics (objects fall), permanence (occluded objects still exist), and causality (pushing causes motion) — all from passive observation. Condition it on actions, and you get an interactive world model.
Genie
Generative Interactive Environment. Trained on 200K hours of unlabeled 2D platformer videos. Learns a latent action space from video alone — no action labels needed. Given a single image, generates a playable interactive environment. Uses a spatiotemporal (ST) transformer with video tokenizer and dynamics model. 11B parameters.
UniSim
Universal Simulator. A single generative model that simulates how scenes evolve given any type of action input — robot manipulation, human actions, camera motion, or text descriptions. Trained on diverse internet video + robotics data. Enables training RL policies and VLAs entirely in simulation without hand-crafted environments.
Cosmos
World Foundation Model. NVIDIA's platform for building physical AI. Trained on massive video datasets. Generates photorealistic video conditioned on text, actions, or scene descriptions. Designed as a backbone for robotics, autonomous driving, and industrial simulation. Tokenizer + autoregressive/diffusion generation up to 14B parameters.
World Models (Original)
The paper that started it all. VAE encoder + MDN-RNN dynamics model + small linear controller. Trained to play car racing and VizDoom in imagination. Showed that a compressed latent space + learned dynamics is sufficient for control — even when the real environment is never revisited during policy training.
Training World Models
Training a world model follows a distinct pattern that separates data collection, model learning, and policy optimization. The key insight of Dreamer-style approaches is that policy learning happens entirely in imagination — using the world model as a surrogate environment.
The Training Pipeline
Collect Trajectories
Gather (s, a, s', r) tuples from environment interaction. Initially random, then using the current policy. Data goes into a replay buffer. For video world models, this is replaced by internet-scale video datasets.
Train Encoder + Dynamics + Decoder
The encoder compresses observations to latent states. The dynamics model learns to predict zt+1 from zt and at. The decoder (if used) reconstructs observations for supervision. In Dreamer, the RSSM is trained with a combination of reconstruction loss, KL divergence, and reward prediction loss.
Imagine Rollouts
Starting from real encoded states, unroll the dynamics model forward for H steps (typically 15 in DreamerV3) using the current policy to select actions. This produces imagined trajectories {z1, ..., zH} with predicted rewards at each step.
Train Actor-Critic in Imagination
The critic estimates value V(zt). The actor maximizes expected imagined returns using backpropagation through the dynamics model. No real environment interaction needed. DreamerV3 uses symlog-transformed predictions and a fixed discount factor of γ = 0.997.
Key Challenge: Compounding Error
The fundamental limitation of every world model is compounding error in long-horizon prediction. Each step of imagination introduces a small prediction error. Over many steps, these errors accumulate exponentially, causing the imagined trajectory to diverge from reality.
Adjust per-step error and run a rollout to see how prediction quality degrades over time.
Mitigation Strategies
- Short imagination horizons: DreamerV3 limits rollouts to 15 steps, trading off long-term planning for prediction accuracy.
- Latent-space prediction: Predicting in abstract space (JEPA, RSSM) is more stable than pixel-space prediction because the representation is smoother and lower-dimensional.
- Ensembles and uncertainty: Train multiple dynamics models and use disagreement as an uncertainty signal. Penalize the policy for entering high-uncertainty regions of the imagination.
- Re-planning (MPC): Don't commit to a full imagined plan. Execute one step, observe the real outcome, re-plan from the new state. This bounds the effective horizon of imagination to a single step at a time.
- Hierarchical models: Predict at multiple timescales. High-level abstract predictions (where will I be in 10 seconds?) are more stable than fine-grained step-by-step rollouts.