World Models — Engineermaxxing

What Is a World Model?

A world model is a learned model of environment dynamics: given the current state and an action, it predicts what happens next. Instead of acting blindly in the real world and observing consequences, an agent with a world model can "imagine" the results of its actions before taking them.

This is, in many ways, what your brain does. You don't need to actually drop a glass to know it will shatter. You simulate the outcome internally — a mental model of physics, of cause and effect. World models give AI the same capability: a learned simulator that enables planning, prediction, and data-efficient learning.

Core Insight

The key equation is deceptively simple: s_t+1 = f(s_t, a_t). Given state s and action a, predict the next state. Everything else — Dreamer, JEPA, video prediction — is about making this prediction accurate, efficient, and useful for downstream decision-making.

The approach stands in contrast to model-free reinforcement learning, where agents learn purely from trial-and-error without building any internal representation of how the world works. Model-free methods are simple but sample-hungry: they may need millions of interactions to learn what a model-based agent can infer from imagination alone.

World models have evolved from simple state-space models in control theory to deep neural networks that can predict entire video sequences. The field sits at the intersection of reinforcement learning, generative modeling, and representation learning — and many researchers (notably Yann LeCun) argue that world models are the missing piece needed for human-level AI.

Architecture

The canonical world model architecture has three components arranged in a loop: an encoder that compresses observations into latent states, a dynamics model that predicts how latent states evolve, and a decoder that reconstructs observations from latent states. The imagination loop lets the agent roll out entire trajectories without touching the real environment.

World Model Architecture Interactive

t = 0 · Real observation

Component Breakdown

Encoder (obs → z): Typically a CNN or ViT that maps high-dimensional observations (images, point clouds) into a compact latent vector. In Dreamer, this uses a learned posterior q(z_t | o_t, h_t).
Dynamics Model (z_t, a_t → z_t+1): The core predictor. Usually an RSSM (Recurrent State-Space Model) in Dreamer, or a transformer in newer architectures. Predicts the next latent state given current state and action.
Decoder (z → obs): Reconstructs observations for supervision. In JEPA, this component is deliberately omitted — predictions stay in latent space.
Reward predictor (z → r): A small MLP that predicts reward from latent state. Needed for RL-based world models like Dreamer.

Core Paradigms

World models come in several flavors, each with different assumptions about what to predict, where to predict, and how to use the predictions. Click each card to expand its details.

◇

JEPA

Joint Embedding Predictive Architecture

Predict in latent space, not pixel space. Uses a target encoder (EMA-updated) and a predictor network. No decoder needed.

Proposed by Yann LeCun as the blueprint for autonomous machine intelligence. JEPA avoids the pitfalls of generative pixel prediction by operating entirely in embedding space. The predictor learns abstract relationships — how objects move, what happens when forces are applied — without wasting capacity on irrelevant visual details like exact textures.

◯

Dreamer / DreamerV3

Latent Imagination + Actor-Critic

Learn a latent dynamics model (RSSM), then train an actor-critic entirely in imagination. Zero real-environment interactions during policy learning.

DreamerV3 (Hafner et al., 2023) is the state of the art in model-based RL. It uses a Recurrent State-Space Model (RSSM) with both deterministic and stochastic components, trains on imagined rollouts of 15 steps, and uses symlog predictions to handle varying reward scales. A single algorithm, fixed hyperparameters, 150+ tasks across diverse domains.

▶

Video Prediction

Genie, UniSim, Cosmos

Train on internet-scale video to learn a general-purpose simulator. Condition on actions or text to generate interactive environments.

The boldest bet: if you train a large enough video model on enough data, you get an implicit simulator of the world. Genie (DeepMind) learns from unlabeled video and can generate playable 2D environments. UniSim and Cosmos scale this to photorealistic 3D simulation. The model learns physics, object permanence, and causality implicitly from visual patterns.

⚙

Model-Predictive Control

Planning with Learned Dynamics

Use the world model as a simulator for planning. At each step, optimize action sequences by rolling out imagined trajectories and picking the best.

MPC is the classic use of world models: at each timestep, sample many action sequences, roll each forward through the learned model, evaluate cumulative reward, and execute the first action of the best sequence. Methods like CEM (Cross-Entropy Method) and MPPI are used for optimization. TD-MPC2 (Hansen et al., 2024) combines learned dynamics with temporal-difference learning for state-of-the-art results.

JEPA Deep Dive

Yann LeCun's Joint Embedding Predictive Architecture (2022) is not just another world model — it's a proposal for how intelligence itself should be structured. The core argument: predicting raw pixels is wasteful, because most pixel-level detail is irrelevant for planning and decision-making.

Why Not Predict Pixels?

Consider predicting what happens when you push a cup off a table. A pixel-level predictor must generate the exact texture of the cup, the precise pattern of shattering, the specific splash of liquid — details that are both computationally expensive and inherently unpredictable (chaotic dynamics at the pixel level). Yet the abstract outcome is simple: the cup falls, it breaks. JEPA captures this abstract prediction by operating in embedding space.

Energy-Based Formulation

E(x, y) = || s_y - Pred(s_x, z) ||²

Where s_x = Enc_θ(x), s_y = Enc_ξ(y) (EMA target), and z is a latent variable capturing uncertainty. Low energy = compatible pair. The system learns to assign low energy to correct predictions and high energy to incorrect ones.

JEPA vs Pixel Prediction

Prediction Space Comparison Interactive

Showing: JEPA (Latent Space)

Pixel Prediction (Generative)

Image x (224x224x3)

↓ Encoder

Latent z (4096-d)

↓ Predictor

Predicted latent z'

↓ Decoder

Predicted Image (224x224x3)

Must reconstruct ~150K pixels. Wastes capacity on texture, lighting, irrelevant visual detail. Loss is in pixel space.

JEPA (Latent Prediction)

Image x (224x224x3)

↓ Context Encoder

Embedding s_x (4096-d)

↓ Predictor

Predicted Embedding s' (4096-d)

Predicts ~4K abstract features. No decoder needed. Captures semantic content, ignores irrelevant detail. Loss is in embedding space.

The JEPA Family

I-JEPA (Image JEPA, 2023): Predicts masked image patch representations. Outperforms MAE on semantic tasks while being more compute-efficient. Uses a Vision Transformer encoder with multi-block masking.
V-JEPA (Video JEPA, 2024): Extends to video. Predicts representations of masked spatiotemporal regions. Learns motion, object permanence, and temporal dynamics without any pixel-level reconstruction.
MC-JEPA (forthcoming): Multi-modal, cross-modal predictions in embedding space. The eventual goal: a unified world model that predicts across vision, language, and action.

Video as World Model

A radical idea: if you train a video generation model on enough internet video, you implicitly learn a simulator of the world. The model learns physics (objects fall), permanence (occluded objects still exist), and causality (pushing causes motion) — all from passive observation. Condition it on actions, and you get an interactive world model.

2024

DeepMind

Genie

Generative Interactive Environment. Trained on 200K hours of unlabeled 2D platformer videos. Learns a latent action space from video alone — no action labels needed. Given a single image, generates a playable interactive environment. Uses a spatiotemporal (ST) transformer with video tokenizer and dynamics model. 11B parameters.

2023

Google

UniSim

Universal Simulator. A single generative model that simulates how scenes evolve given any type of action input — robot manipulation, human actions, camera motion, or text descriptions. Trained on diverse internet video + robotics data. Enables training RL policies and VLAs entirely in simulation without hand-crafted environments.

2024

NVIDIA

Cosmos

World Foundation Model. NVIDIA's platform for building physical AI. Trained on massive video datasets. Generates photorealistic video conditioned on text, actions, or scene descriptions. Designed as a backbone for robotics, autonomous driving, and industrial simulation. Tokenizer + autoregressive/diffusion generation up to 14B parameters.

2018

Ha & Schmidhuber

World Models (Original)

The paper that started it all. VAE encoder + MDN-RNN dynamics model + small linear controller. Trained to play car racing and VizDoom in imagination. Showed that a compressed latent space + learned dynamics is sufficient for control — even when the real environment is never revisited during policy training.

Training World Models

Training a world model follows a distinct pattern that separates data collection, model learning, and policy optimization. The key insight of Dreamer-style approaches is that policy learning happens entirely in imagination — using the world model as a surrogate environment.

Latent Imagination Loop Interactive

Imagination step: idle

The Training Pipeline

Collect Trajectories

Gather (s, a, s', r) tuples from environment interaction. Initially random, then using the current policy. Data goes into a replay buffer. For video world models, this is replaced by internet-scale video datasets.

Train Encoder + Dynamics + Decoder

The encoder compresses observations to latent states. The dynamics model learns to predict z_t+1 from z_t and a_t. The decoder (if used) reconstructs observations for supervision. In Dreamer, the RSSM is trained with a combination of reconstruction loss, KL divergence, and reward prediction loss.

Imagine Rollouts

Starting from real encoded states, unroll the dynamics model forward for H steps (typically 15 in DreamerV3) using the current policy to select actions. This produces imagined trajectories {z₁, ..., z_H} with predicted rewards at each step.

Train Actor-Critic in Imagination

The critic estimates value V(z_t). The actor maximizes expected imagined returns using backpropagation through the dynamics model. No real environment interaction needed. DreamerV3 uses symlog-transformed predictions and a fixed discount factor of γ = 0.997.

Sample Efficiency

Dreamer-style training is dramatically more sample-efficient than model-free RL. DreamerV3 matches or exceeds PPO performance on Atari while using 50x fewer environment interactions. Each real interaction generates many imagined training steps, amortizing the cost of data collection.

Key Challenge: Compounding Error

The fundamental limitation of every world model is compounding error in long-horizon prediction. Each step of imagination introduces a small prediction error. Over many steps, these errors accumulate exponentially, causing the imagined trajectory to diverge from reality.

Planning with Learned Dynamics Interactive

Per-step error: 5%

Adjust per-step error and run a rollout to see how prediction quality degrades over time.

Mitigation Strategies

Short imagination horizons: DreamerV3 limits rollouts to 15 steps, trading off long-term planning for prediction accuracy.
Latent-space prediction: Predicting in abstract space (JEPA, RSSM) is more stable than pixel-space prediction because the representation is smoother and lower-dimensional.
Ensembles and uncertainty: Train multiple dynamics models and use disagreement as an uncertainty signal. Penalize the policy for entering high-uncertainty regions of the imagination.
Re-planning (MPC): Don't commit to a full imagined plan. Execute one step, observe the real outcome, re-plan from the new state. This bounds the effective horizon of imagination to a single step at a time.
Hierarchical models: Predict at multiple timescales. High-level abstract predictions (where will I be in 10 seconds?) are more stable than fine-grained step-by-step rollouts.

The Open Question

Can world models scale to open-ended, real-world prediction? Current models work well in constrained domains (game environments, controlled robotic tasks) but struggle with the combinatorial complexity of the real world. Whether scaling video models (Genie, Cosmos) or abstract predictors (JEPA) will bridge this gap remains the central question of the field.