The Complete Beginner's Path

Understand World
Models

How AI learns an internal simulator of the world — predicting what happens next, imagining consequences, and planning without trial and error.

Prerequisites: Basic neural network intuition + Interest in how agents plan. That's it.
9
Chapters
8+
Simulations
0
RL Background Needed

Chapter 0: Imagining the Future

Before you catch a ball, your brain simulates its trajectory. You don't need to try every possible arm position — you predict where the ball will be and move accordingly. This internal simulation is a world model.

An AI world model does the same thing: given the current state and an action, it predicts what the next state will be. If the model is accurate enough, the agent can plan in imagination instead of learning through costly real-world trial and error.

The core idea: Instead of learning what to do by trying things (model-free RL), learn how the world works first (world model), then plan inside your head. Imagination is free; real-world mistakes are expensive.
Real vs Imagined Trajectories

The teal ball follows real physics. The orange trails are imagined futures from the world model. Click to launch a new ball.

Check: What does a world model predict?

Chapter 1: Model-Based Reinforcement Learning

In model-free RL, the agent learns a policy directly from experience: try things, get rewards, adjust. In model-based RL, the agent first learns a dynamics model (how the world works), then uses that model to plan or generate synthetic experience.

Model-Free
state → policy → action (learned from real experience)
vs
Model-Based
state → world model → imagined rollouts → plan → action
Sample Efficiency: Model-Free vs Model-Based

Model-based methods learn faster because they extract more from each real experience. Adjust the model accuracy to see the trade-off.

Model accuracy70%
The trade-off: Model-based RL is more sample-efficient (needs fewer real interactions) but the model can be wrong. A bad world model leads to plans that fail in reality. The key challenge is learning accurate enough models.
Check: What is the main advantage of model-based over model-free RL?

Chapter 2: Latent Dynamics

Predicting the next image pixel-by-pixel is expensive and wasteful — most pixels don't matter for decision-making. Instead, modern world models work in a latent space: they encode observations into compact representations and predict dynamics in that compressed space.

zt = encode(ot)  →  ẑt+1 = dynamics(zt, at)  →  ôt+1 = decode(ẑt+1)
Pixel Space vs Latent Space

Watch how a high-dimensional observation (left) is compressed into a small latent vector (middle), and prediction happens there. Toggle between pixel and latent prediction.

Why latent? A 64×64 RGB image has 12,288 dimensions. A latent vector might have 200. Predicting in latent space is 60× cheaper, and the encoder learns to keep only decision-relevant information (object positions, velocities) while discarding noise (exact textures, shadows).
Check: Why do world models predict in latent space instead of pixel space?

Chapter 3: Dreamer (v1 → v3)

Dreamer is the most successful family of world-model agents. The architecture has three components: a world model (RSSM), an actor (policy), and a critic (value function). The key innovation: the actor and critic are trained entirely inside the world model's imagination.

1. Learn World Model
Train RSSM on real experience to predict latent dynamics
2. Dream
Roll out imagined trajectories using the learned model
3. Learn Policy
Train actor-critic on imagined trajectories
VersionYearKey Improvement
Dreamer v12020Latent imagination + value estimation
Dreamer v22021Discrete latents, KL balancing, Atari mastery
Dreamer v3 (DreamerV3)2023Symlog predictions, fixed hyperparams across domains
Dreamer's Imagination Rollout

The agent imagines future states from the current state. Teal = real states, orange = imagined futures, green = rewards predicted.

Imagination horizon12
DreamerV3's breakthrough: It was the first algorithm to collect diamonds in Minecraft from scratch — a task requiring long-horizon planning, exploration, and tool crafting — all learned through imagination.
Check: Where does Dreamer train its policy?

Chapter 4: JEPA — Joint Embedding Predictive Architecture

Yann LeCun proposed JEPA as an alternative to generative world models. Instead of predicting what the next observation looks like (pixel reconstruction), JEPA predicts the abstract representation of the next state. This avoids wasting capacity on irrelevant details.

predict: embed(xt+1) ≈ predictor(embed(xt), at)

The key difference from autoencoders: JEPA never reconstructs pixels. Both the target and prediction live in embedding space. A VICReg or similar loss prevents the embeddings from collapsing to trivial solutions.

JEPA vs Generative Prediction

Compare: generative models predict pixels (expensive, noisy), JEPA predicts embeddings (cheap, abstract).

LeCun's vision: JEPA is a stepping stone toward autonomous machine intelligence. By learning abstract world models that capture what matters for planning (not pixel-perfect reconstruction), agents could reason at a human-like level of abstraction.
Check: What does JEPA predict?

Chapter 5: Video Prediction — Genie & UniSim

The latest world models don't just predict latent states — they generate entire videos of what will happen next. This is world modeling at scale: train on millions of internet videos and learn a general-purpose simulator of the visual world.

ModelKey IdeaTraining Data
Genie (DeepMind)Learn actions from unlabeled video, playable worldsInternet gameplay videos
UniSim (Google)Universal simulator of visual experienceInternet video + images
Sora (OpenAI)Diffusion transformer for video, implicit physicsInternet video
Cosmos (NVIDIA)World foundation model for physical AIDriving + robotics video
Video World Model: Frame Prediction

Given the current frame and an action, the model predicts future frames. Watch how prediction quality degrades over longer horizons.

Prediction horizon4
From video model to world model: If a video model can predict what happens when you push a cup off a table (it falls), it has implicitly learned gravity. Video prediction at scale may be the path to general-purpose physical understanding.
Check: What makes Genie special compared to traditional world models?

Chapter 6: Planning with World Models

Having a world model is only useful if you can use it to make decisions. Planning algorithms search through imagined futures to find the best action sequence. Major approaches:

MethodHow It PlansUsed In
Random ShootingSample many action sequences, pick bestPETS
CEM (Cross-Entropy)Iteratively refine action distributionPETS, TD-MPC
MCTS (Tree Search)Build a search tree of statesMuZero, EfficientZero
Backprop through modelGradient-based trajectory optimizationDreamer
Planning by Random Shooting

The agent imagines many possible futures (gray) and picks the one with the highest reward (green). Click to re-plan.

Samples30
MuZero's triumph: DeepMind's MuZero learns a world model + uses MCTS planning to master Go, Chess, Shogi, and Atari — all with the same algorithm, no rules provided. It imagines game states and searches for the best move.
Check: How does CEM (Cross-Entropy Method) plan?

Chapter 7: Open Problems

World models are powerful but far from solved. Key challenges remain:

ProblemWhy It's HardCurrent Approaches
Compounding errorsPrediction errors accumulate over long rolloutsShorter horizons, latent space, ensembles
Partial observabilityThe agent can't see everythingRecurrent state (RSSM), memory
Stochastic environmentsMultiple futures are possibleStochastic latents, discrete codes
GeneralizationTransfer between environmentsFoundation world models (Genie, Cosmos)
Computational costPlanning is expensive at test timeAmortized policies, model distillation
Compounding Error over Horizon

Watch how prediction error grows with each imagined step. The red band shows the uncertainty growing.

Model quality60%
The fundamental tension: Long planning horizons give better decisions but accumulate more error. Short horizons are accurate but myopic. Finding the sweet spot — or making models accurate enough for long horizons — is the central open problem.
Check: Why do world model predictions degrade over long horizons?

Chapter 8: The Big Picture

World models sit at the intersection of reinforcement learning, generative modeling, and representation learning. They're the foundation for agents that can reason about consequences before acting — a capability that separates reactive systems from truly intelligent ones.

Generative AI
Image/video generation → visual world models
+
RL / Planning
Decision-making → acting in imagined worlds
=
World Models
Agents that imagine, plan, and act intelligently
"A world model is a predictive engine that allows an agent to imagine the consequences of actions without performing them."
— David Ha & Jürgen Schmidhuber

You now understand how AI learns to dream. The ability to simulate the future in imagination — to ask "what if?" before committing to action — may be the most important capability an intelligent agent can have.

Check: What is the unifying idea behind all world models?