How AI learns an internal simulator of the world — predicting what happens next, imagining consequences, and planning without trial and error.
Before you catch a ball, your brain simulates its trajectory. You don't need to try every possible arm position — you predict where the ball will be and move accordingly. This internal simulation is a world model.
An AI world model does the same thing: given the current state and an action, it predicts what the next state will be. If the model is accurate enough, the agent can plan in imagination instead of learning through costly real-world trial and error.
A self-driving car can't learn to avoid pedestrians by hitting a few first. A surgical robot can't learn by botching operations. A warehouse robot can't learn by dropping expensive packages. Every real interaction has a cost — time, money, safety risk. A world model lets you make millions of mistakes inside your head before taking a single real action.
The teal ball follows real physics. The orange trails are imagined futures from the world model. Click to launch a new ball.
Before we dive in, let's define three RL terms we'll use throughout this lesson. A policy is a rule that tells the agent what action to take in each situation — think of it as the agent's strategy or decision-making function. A reward is a number the environment gives after each action, telling the agent how well it did. Reinforcement learning (RL) is the process of learning a good policy by trial and error — trying actions, observing rewards, and adjusting the strategy.
With those in hand: in model-free RL, the agent learns a policy directly from experience — try things, get rewards, adjust. In model-based RL, the agent first learns a dynamics model (a simulator of how the world works), then uses that model to plan or generate synthetic experience.
Model-based methods learn faster because they extract more from each real experience. Adjust the model accuracy to see the trade-off.
How big is the difference in practice? On the Atari 100K benchmark (only 100K environment steps allowed — about 2 hours of human play):
| Method | Type | Human-Normalized Score | Env Steps |
|---|---|---|---|
| Random Agent | — | 0% | — |
| DQN | Model-free | 15% | 100K |
| PPO | Model-free | 20% | 100K |
| DreamerV2 | Model-based | 51% | 100K |
| EfficientZero | Model-based | 116% | 100K |
| Human | — | 100% | — |
At its core, a world model is just a function with a very specific signature:
Given the current state st and an action at, predict the next state st+1 and the reward rt. Everything else — RSSM, latent spaces, categorical distributions — is engineering to make this function more accurate, more efficient, or more general.
Once you have a dynamics model, there are three distinct strategies for exploiting it:
| Strategy | How It Works | Example |
|---|---|---|
| Background planning | Generate synthetic data, add to replay buffer (a memory bank that stores past experiences so the agent can re-learn from them), train model-free agent on mixed real + imagined data | Dyna-Q, MBPO |
| Decision-time planning | At each real step, imagine many futures, pick the best first action | CEM, MCTS, MuZero |
| Analytic gradients | Backpropagate reward gradients through the model to optimize the policy | Dreamer, SVG |
Predicting the next image pixel-by-pixel is expensive and wasteful — most pixels don't matter for decision-making. Instead, modern world models work in a latent space: they encode observations into compact representations and predict dynamics in that compressed space.
Watch how a high-dimensional observation (left) is compressed into a small latent vector (middle), and prediction happens there. Toggle between pixel and latent prediction.
Let's trace the exact tensor shapes in a typical latent dynamics model (as used in PlaNet/Dreamer):
The encoder learns to keep decision-relevant information and discard everything else:
This is why latent world models work so well for RL: the agent doesn't need to predict the exact shade of green on a leaf to decide whether to jump over a pit. It only needs to know: "pit ahead, 3 units away."
The world model is trained on three losses simultaneously, each teaching it something different:
| Loss Term | What It Teaches | Typical Weight |
|---|---|---|
| Lrecon = ||ôt - ot||2 | Encode enough to reconstruct the observation | 1.0 |
| LKL = KL(posterior || prior) | Prior should match posterior (imagination becomes accurate) | β = 0.1 to 1.0 |
| Lreward = ||r̂t - rt||2 | Predict rewards from latent state | 1.0 |
You have a latent world model with encoder E, dynamics model f, decoder D, and reward predictor R. The model receives sequences of (ot, at, rt) from a replay buffer.
Your task: Derive why the training loss must have three terms (reconstruction + KL + reward), and why leaving any one out causes the model to fail at imagination.
Full derivation:
We want to learn a latent model where imagination produces useful training signal. This requires:
1. Lrecon = ||D(zt) - ot||² — Forces the encoder to preserve decision-relevant information. Without it, z becomes trivial.
2. LKL = KL(q(z|h,o) || p(z|h)) — The ELBO derivation gives this naturally. The posterior q uses real observations; the prior p uses only history. Minimizing KL makes p generate z samples that look like those from q. Since imagination uses only p, this ensures dreamed latents are realistic.
3. Lreward = ||R(h,z) - r||² — The policy optimizes imagined rewards. If these are wrong, the policy learns to optimize the wrong objective. Reward prediction must be accurate for policy training to transfer to reality.
The key insight: This is actually the evidence lower bound (ELBO) for a sequential VAE. Lrecon is the likelihood term, LKL is the complexity penalty, and Lreward is an auxiliary task that makes the latent space reward-predictive. The three losses create a triangle of constraints: encode well ↔ imagine well ↔ evaluate well.
Both systems maintain a belief state updated in two phases: predict (propagate forward using dynamics alone) then update (correct using new observation). The Kalman filter does this with linear Gaussians; the RSSM does it with neural networks and categorical distributions. The KL loss in the RSSM plays the exact role of the innovation covariance in the Kalman filter — it measures how surprised the system is by the observation.
Can you identify the RSSM's "Kalman gain" — the mechanism that controls how much the posterior shifts from the prior when a new observation arrives?
During imagination, you have NO observations — only the history of imagined states. The posterior q(z|h,o) requires an observation o, so you can't use it when dreaming. You need a prior p(z|h) that generates plausible latent states from history alone. But the prior is only useful if it generates states similar to what the posterior would produce with real data — which is exactly what the KL loss enforces. Without KL training, the prior generates random, unrealistic latents, and any policy trained in that imagination will fail in reality. The prior IS the imagination engine; the KL loss IS the quality control that keeps imagination grounded in reality.
Dreamer is the most successful family of world-model agents. The architecture has three components: a world model (RSSM), an actor (policy), and a critic (value function). The key innovation: the actor and critic are trained entirely inside the world model's imagination.
The core of Dreamer is the Recurrent State-Space Model (RSSM). Unlike a simple latent vector, the RSSM splits state into two parts:
Total latent state: ht [200] + zt [1024] = 1,224 dimensions. The split matters: h carries deterministic dynamics (ball moving along a trajectory), z captures stochastic branching (which direction the ball bounces).
| Version | Year | Key Improvement |
|---|---|---|
| Dreamer v1 | 2020 | Latent imagination + value estimation |
| Dreamer v2 | 2021 | Discrete latents (32×32 categorical), KL balancing, Atari mastery |
| DreamerV3 | 2023 | Symlog predictions, fixed hyperparams, works across domains unchanged |
The agent imagines future states from the current state. Teal = real states, orange = imagined futures, green = rewards predicted. Notice how opacity fades — the model is less certain further into the future.
Here's the exact computation for one imagination step. Starting from a real state (h3, z3) after 3 real environment steps:
pseudocode # Current real state h3 = [200] # GRU hidden state from real experience z3 = [1024] # Stochastic state (32 classes × 32 dims) # === Imagination step 1 === a3 = actor(h3, z3) # Policy outputs action [action_dim] h4 = GRU(h3, cat(z3, a3)) # Deterministic update: [200] z4 ~ categorical(prior_net(h4)) # Sample stochastic: [1024] r4 = reward_net(h4, z4) # Predicted reward: scalar # === Imagination step 2 === a4 = actor(h4, z4) h5 = GRU(h4, cat(z4, a4)) z5 ~ categorical(prior_net(h5)) r5 = reward_net(h5, z5) # ... repeat for H=15 total steps # Then backprop ∇θ through ALL 15 steps!
The RSSM has two paths to estimate zt: the prior p(zt | ht) which uses only the deterministic history, and the posterior q(zt | ht, ot) which also sees the current observation.
Your task: Derive from Bayes' rule why the posterior must condition on ot, and explain the information-theoretic reason the posterior is always better than the prior for training.
Full derivation:
1. Generative model: zt ~ p(z|ht), ot ~ p(o|zt,ht). The prior captures what the model expects before seeing reality.
2. Bayes' rule: p(zt|ht,ot) = p(ot|zt,ht) · p(zt|ht) / p(ot|ht). The observation ot sharpens the distribution over z — eliminating latent states that couldn't have produced this observation.
3. For training: We approximate this true posterior with q(z|h,o) = neural_net(h,encode(o)). The ELBO gives: log p(o) ≥ Eq[log p(o|z)] - KL(q||p). Using q (posterior) for training maximizes a tight bound. Using p (prior) would be sampling z randomly, hoping some explain o — astronomically wasteful.
4. Information gain: I(zt; ot | ht) = KL(q || p) measures exactly how much information the observation provides about the latent. A large KL means the observation is highly informative — the prior was surprised.
The key insight: The posterior is the "teacher" that shows the prior what z values actually correspond to real observations. Training with the prior alone would be like studying for an exam without an answer key — you'd never know which of your guesses were correct.
Dreamer trains its policy by backpropagating through H imagined steps. The actor πθ selects actions, the world model produces next states, and a critic estimates values.
Your task: Write out the objective J(θ) that the actor maximizes during dreaming, and show why gradients can flow through the imagination (unlike model-free RL where they can't).
Full derivation:
1. The objective: J(θ) = Eπθ[∑t=1H γt-1 r̂t + γH Vψ(hH, zH)]
2. The computation graph (one step):
at = πθ(ht, zt) [differentiable — policy is a neural net]
ht+1 = GRU(ht, [zt, at]) [differentiable — GRU is a neural net]
zt+1 ~ Cat(prior(ht+1)) [differentiable via straight-through]
rt+1 = reward_net(ht+1, zt+1) [differentiable]
3. Gradient for step t through step t+k:
∇θ rt+k = (∂r/∂st+k) · (∏i=0k-1 ∂st+i+1/∂at+i · ∂at+i/∂st+i) · (∂at/∂θ)
4. Why model-free can't do this: In model-free RL, st+1 = env(st, at). The "env" function is the real world — no Jacobian exists. You must use REINFORCE (high variance) or temporal-difference (bootstrapping). Dreamer replaces env() with a differentiable neural net, so direct backprop works.
The key insight: Dreamer converts RL into supervised learning. The actor gets a dense, low-variance gradient signal by differentiating through the world model, exactly like training a classifier. This is why it's so much more sample-efficient than policy gradient methods that rely on noisy reward-weighted log-probabilities.
z_onehot = F.one_hot(logits.argmax(-1), 32) for forward, z = z_onehot + probs - probs.detach() for gradient flow. Then flatten the 32×32 one-hots to get [batch, 1024].python def rssm_step(prev_h, prev_z, action, observation, gru, prior_net, posterior_net, encoder): # Step 1: Deterministic state update # GRU input = concat(prev_z, action) gru_input = torch.cat([prev_z, action], dim=-1) # [B, 1024+act_dim] h = gru(gru_input, prev_h) # [B, 200] # Step 2: Prior (what the model expects from dynamics alone) prior_logits = prior_net(h).reshape(-1, 32, 32) # [B, 32, 32] # Step 3: Posterior (correct belief using real observation) if observation is not None: embed = encoder(observation) # [B, embed_dim] post_input = torch.cat([h, embed], dim=-1) # [B, 200+embed_dim] posterior_logits = posterior_net(post_input).reshape(-1, 32, 32) logits_to_sample = posterior_logits # use posterior during training else: posterior_logits = None logits_to_sample = prior_logits # use prior during imagination # Step 4: Sample z with straight-through gradients probs = F.softmax(logits_to_sample, dim=-1) # [B, 32, 32] indices = torch.argmax(logits_to_sample, dim=-1) # [B, 32] z_onehot = F.one_hot(indices, 32).float() # [B, 32, 32] # Straight-through: forward uses discrete, backward uses continuous z_st = z_onehot + probs - probs.detach() # [B, 32, 32] z = z_st.reshape(-1, 32 * 32) # [B, 1024] return h, prior_logits, posterior_logits, z
The RSSM is a conditional VAE at each timestep, where the conditioning context is the deterministic history ht. The prior p(z) in a standard VAE is fixed (usually N(0,1)); in the RSSM, it's learned — the prior network predicts what z should be from dynamics alone. This is what makes the "dreaming" mode possible: the learned prior replaces the observation.
In a standard VAE, KL(q||p) regularizes toward a fixed prior. In the RSSM, what does KL(posterior||prior) actually measure about the world model's understanding?
Yann LeCun proposed JEPA as an alternative to generative world models. Instead of predicting what the next observation looks like (pixel reconstruction), JEPA predicts the abstract representation of the next state. This avoids wasting capacity on irrelevant details.
The key difference from autoencoders: JEPA never reconstructs pixels. Both the target and prediction live in embedding space. A VICReg or similar loss prevents the embeddings from collapsing to trivial solutions.
JEPA has a surprisingly simple implementation with three components that must be carefully balanced:
The loss that prevents collapse has three components. Each solves a different failure mode:
| Term | What It Does | What Breaks Without It |
|---|---|---|
| Invariance | MSE between predicted ẑt+1 and target z*t+1 | Predictions wouldn't match reality |
| Variance | Forces std(z) ≥ 1 along each dimension | All embeddings collapse to a single point (constant output) |
| Covariance | Decorrelates embedding dimensions | All dimensions encode the same information (rank collapse) |
Compare: generative models predict pixels (expensive, noisy), JEPA predicts embeddings (cheap, abstract).
V-JEPA (Video JEPA, Meta 2024) extends the idea to video. Instead of predicting the embedding of the next frame, it predicts the embedding of a masked spacetime region:
The latest world models don't just predict latent states — they generate entire videos of what will happen next. This is world modeling at scale: train on millions of internet videos and learn a general-purpose simulator of the visual world.
Video generation models (like Sora) extend image generation to the temporal dimension. The key architectural idea: treat video as a 3D grid of spacetime patches:
| Model | Key Idea | Training Data | Architecture |
|---|---|---|---|
| Genie (DeepMind) | Learn actions from unlabeled video | Internet gameplay videos | VQ-VAE + masked transformer |
| UniSim (Google) | Universal simulator of visual experience | Internet video + images | Cascaded diffusion |
| Sora (OpenAI) | Diffusion transformer, implicit physics | Internet video | DiT with spacetime patches |
| Cosmos (NVIDIA) | World foundation model for physical AI | Driving + robotics video | Autoregressive + diffusion |
Given the current frame and an action, the model predicts future frames. Watch how prediction quality degrades over longer horizons.
Genie's key insight is radical: you don't need labeled actions to build an interactive world model. From millions of unlabeled gameplay videos, Genie automatically discovers a latent action space:
Having a world model is only useful if you can use it to make decisions. Planning algorithms search through imagined futures to find the best action sequence. Major approaches:
| Method | How It Plans | Used In |
|---|---|---|
| Random Shooting | Sample many action sequences, pick best | PETS |
| CEM (Cross-Entropy) | Iteratively refine action distribution | PETS, TD-MPC |
| MCTS (Tree Search) | Build a search tree of states | MuZero, EfficientZero |
| Backprop through model | Gradient-based trajectory optimization | Dreamer |
The Cross-Entropy Method (CEM) is the workhorse planning algorithm. Here's exactly what happens at each decision step for a robot arm with 6 joints and a horizon of H=12 steps:
The agent imagines many possible futures (gray) and picks the one with the highest reward (green). Click to re-plan.
Dreamer and CEM-based methods (like TD-MPC) represent fundamentally different approaches to using a world model:
| Property | Dreamer (Backprop) | CEM (Sampling) |
|---|---|---|
| How it plans | Gradient of reward w.r.t. policy through imagined trajectory | Sample many action sequences, keep the best |
| Produces | An amortized policy (reusable) | A plan for right now (replan each step) |
| Cost at test time | One forward pass (cheap) | Thousands of forward passes (expensive) |
| Handles | Continuous + discrete actions | Continuous actions primarily |
| Weakness | Policy may be suboptimal in novel states | Expensive at decision time, action space must be small |
Real-world solutions (NVIDIA Cosmos, Wayve GAIA-1, Tesla):
Latent size: ~256-512 dims for the state. BEV representation: 200×200 grid at 0.5m resolution with 64-128 channels — but compressed to ~512-dim summary for temporal dynamics. The full BEV is used for spatial reasoning; the compressed state for temporal rollout.
Sensor fusion: Bird's Eye View (BEV) is the winning approach. Each camera is projected into a shared BEV grid using LSS (Lift-Splat-Shoot) or BEVFormer-style cross-attention. LiDAR provides direct 3D supervision. This gives a unified spatial representation regardless of camera count.
Prediction space: Industry uses BOTH. A fast latent model for planning (runs in ~20ms for 30 steps), plus a slower video prediction model for visualization and validation. The latent model predicts occupancy, velocity fields, and agent trajectories — not pixels.
Planning budget: 512×30 CEM steps = 15,360 forward passes. At ~2μs per latent step on Orin, that's ~30ms — feasible! But most production systems use a learned policy (like Dreamer) for the initial 90% of situations, falling back to CEM only for complex scenarios. This gives a ~5ms typical + 50ms worst-case split.
Uncertainty: Ensemble of 5 world models. When they disagree on the predicted BEV occupancy by more than a threshold, the system flags "unreliable prediction" and reverts to a conservative emergency policy. This is epistemic uncertainty — knowing what you don't know.
World models are powerful but far from solved. Key challenges remain:
| Problem | Why It's Hard | Current Approaches |
|---|---|---|
| Compounding errors | Prediction errors accumulate over long rollouts | Shorter horizons, latent space, ensembles |
| Partial observability | The agent can't see everything | Recurrent state (RSSM), memory |
| Stochastic environments | Multiple futures are possible | Stochastic latents, discrete codes |
| Generalization | Transfer between environments | Foundation world models (Genie, Cosmos) |
| Computational cost | Planning is expensive at test time | Amortized policies, model distillation |
If your world model has 2% per-step prediction error, how bad does it get over long horizons?
| Horizon | Cumulative Error | Usable? |
|---|---|---|
| H = 5 | ~10% | Solid — plans are reliable |
| H = 15 | ~26% | Dreamer's sweet spot — noisy but useful |
| H = 30 | ~45% | Degraded — imagination diverges from reality |
| H = 50 | ~64% | Broken — model is hallucinating entire scenarios |
This is why Dreamer uses H=15 by default. Beyond that, the imagined world is too different from reality for the policy to transfer. The error compounds roughly as 1 - (1 - ε)H, so even a "good" model becomes unreliable if you dream too far ahead.
Watch how prediction error grows with each imagined step. The red band shows the uncertainty growing. Move the slider to see how model quality affects the usable horizon.
Compounding error gets dramatically worse when the agent encounters situations outside its training distribution. Consider a robot trained in a warehouse:
| Scenario | Model Quality | Why |
|---|---|---|
| Picking known boxes | Excellent | Seen thousands of times in training |
| New box shape | Good | Similar dynamics, model generalizes |
| Wet floor | Poor | Different friction — model never saw this |
| Human walks through workspace | Terrible | Completely novel dynamic object — imagination hallucinates |
World models sit at the intersection of reinforcement learning, generative modeling, and representation learning. They're the foundation for agents that can reason about consequences before acting — a capability that separates reactive systems from truly intelligent ones.
| Approach | Prediction Space | Planning Method | Data Efficiency | Best For |
|---|---|---|---|---|
| Dreamer (RSSM) | Latent (1224-dim) | Backprop through model | Very high | Continuous control, Atari, Minecraft |
| MuZero | Latent (abstract) | MCTS tree search | High | Board games, discrete Atari |
| JEPA | Embedding space | Not yet integrated | N/A (repr. learning) | Video understanding, pretraining |
| Sora / Cosmos | Pixel space (video) | Not yet used for planning | Low (needs internet-scale data) | Video generation, simulation |
| TD-MPC2 | Latent | CEM + value function | High | Robotics, multi-task control |
World models connect deeply to other topics on this site:
• Diffusion Models — Sora and Cosmos use diffusion as the generation backbone for video world models.
• VLAs (Vision-Language-Action) — World models provide the "imagination engine" for robot planning.
• RL Algorithms — Dreamer is model-based RL; compare with model-free PPO and SAC.
• Contrastive Learning (CLIP) — JEPA's VICReg loss is a non-contrastive alternative to CLIP's InfoNCE.
• MDPs — World models learn the transition function T(s'|s,a) that defines an MDP.
You now understand how AI learns to dream. The ability to simulate the future in imagination — to ask "what if?" before committing to action — may be the most important capability an intelligent agent can have.