The Complete Beginner's Path

Understand World
Models

How AI learns an internal simulator of the world — predicting what happens next, imagining consequences, and planning without trial and error.

Prerequisites: Basic neural network intuition + Interest in how agents plan. That's it.
9
Chapters
8+
Simulations
0
RL Background Needed

Chapter 0: Imagining the Future

Before you catch a ball, your brain simulates its trajectory. You don't need to try every possible arm position — you predict where the ball will be and move accordingly. This internal simulation is a world model.

An AI world model does the same thing: given the current state and an action, it predicts what the next state will be. If the model is accurate enough, the agent can plan in imagination instead of learning through costly real-world trial and error.

The core idea: Instead of learning what to do by trying things (model-free RL), learn how the world works first (world model), then plan inside your head. Imagination is free; real-world mistakes are expensive.

Why This Matters: The Cost of Real Experience

A self-driving car can't learn to avoid pedestrians by hitting a few first. A surgical robot can't learn by botching operations. A warehouse robot can't learn by dropping expensive packages. Every real interaction has a cost — time, money, safety risk. A world model lets you make millions of mistakes inside your head before taking a single real action.

Model-Free RL:
Learn by doing.
1 real step = 1 learning step
Need millions of interactions
Expensive, slow, dangerous
Model-Based RL:
Learn by imagining.
1 real step = H imagined steps
Need thousands of interactions
Cheap, fast, safe
Real vs Imagined Trajectories

The teal ball follows real physics. The orange trails are imagined futures from the world model. Click to launch a new ball.

Check: What does a world model predict?

Chapter 1: Model-Based Reinforcement Learning

Before we dive in, let's define three RL terms we'll use throughout this lesson. A policy is a rule that tells the agent what action to take in each situation — think of it as the agent's strategy or decision-making function. A reward is a number the environment gives after each action, telling the agent how well it did. Reinforcement learning (RL) is the process of learning a good policy by trial and error — trying actions, observing rewards, and adjusting the strategy.

With those in hand: in model-free RL, the agent learns a policy directly from experience — try things, get rewards, adjust. In model-based RL, the agent first learns a dynamics model (a simulator of how the world works), then uses that model to plan or generate synthetic experience.

Model-Free
state → policy → action (learned from real experience)
vs
Model-Based
state → world model → imagined rollouts → plan → action
Sample Efficiency: Model-Free vs Model-Based

Model-based methods learn faster because they extract more from each real experience. Adjust the model accuracy to see the trade-off.

Model accuracy70%
The trade-off: Model-based RL is more sample-efficient (needs fewer real interactions) but the model can be wrong. A bad world model leads to plans that fail in reality. The key challenge is learning accurate enough models.

The Data Efficiency Gap

How big is the difference in practice? On the Atari 100K benchmark (only 100K environment steps allowed — about 2 hours of human play):

MethodTypeHuman-Normalized ScoreEnv Steps
Random Agent0%
DQNModel-free15%100K
PPOModel-free20%100K
DreamerV2Model-based51%100K
EfficientZeroModel-based116%100K
Human100%
EfficientZero beats humans with just 2 hours of play. DQN, a model-free method, barely scratches the surface with the same data. The difference: EfficientZero learns a world model and imagines millions of additional game states. Every real experience gets amplified through imagination.

The World Model Function Signature

At its core, a world model is just a function with a very specific signature:

st+1, rt = f(st, at)

Given the current state st and an action at, predict the next state st+1 and the reward rt. Everything else — RSSM, latent spaces, categorical distributions — is engineering to make this function more accurate, more efficient, or more general.

Three Ways to Use a World Model

Once you have a dynamics model, there are three distinct strategies for exploiting it:

StrategyHow It WorksExample
Background planningGenerate synthetic data, add to replay buffer (a memory bank that stores past experiences so the agent can re-learn from them), train model-free agent on mixed real + imagined dataDyna-Q, MBPO
Decision-time planningAt each real step, imagine many futures, pick the best first actionCEM, MCTS, MuZero
Analytic gradientsBackpropagate reward gradients through the model to optimize the policyDreamer, SVG
Each has trade-offs: Background planning is simple but slow to propagate information. Decision-time planning is powerful but expensive at every step. Analytic gradients are elegant but require the model to be differentiable end-to-end. Dreamer uses the third approach; MuZero uses the second.
Check: What is the main advantage of model-based over model-free RL?

Chapter 2: Latent Dynamics

Predicting the next image pixel-by-pixel is expensive and wasteful — most pixels don't matter for decision-making. Instead, modern world models work in a latent space: they encode observations into compact representations and predict dynamics in that compressed space.

zt = encode(ot)  →  ẑt+1 = dynamics(zt, at)  →  ôt+1 = decode(ẑt+1)
Pixel Space vs Latent Space

Watch how a high-dimensional observation (left) is compressed into a small latent vector (middle), and prediction happens there. Toggle between pixel and latent prediction.

Why latent? A 64×64 RGB image has 12,288 dimensions. A latent vector might have 200. Predicting in latent space is 60× cheaper, and the encoder learns to keep only decision-relevant information (object positions, velocities) while discarding noise (exact textures, shadows).

Concrete Shapes: Latent World Model

Let's trace the exact tensor shapes in a typical latent dynamics model (as used in PlaNet/Dreamer):

Raw Observation
ot: [64, 64, 3] = 12,288 dimensions
↓ CNN encoder (4 conv layers)
Encoder Output
conv4: [4, 4, 256] = 4,096 dims → flatten → [4096]
Linear(4096, 200) → zt: [200]
↓ dynamics model f(zt, at)
Latent Prediction
MLP: [200 + action_dim] → 200 → 200
t+1: [200]
↓ decoder (optional, for visualization)
Reconstructed Observation
Linear(200, 4096) → reshape → deconv → [64, 64, 3]
The compression ratio: From 12,288 dims to 200 dims = 61× compression. And prediction in this space is just an MLP forward pass — microseconds, not milliseconds. That's why you can imagine thousands of future steps in the time it takes to render one real frame.

What Gets Kept, What Gets Discarded?

The encoder learns to keep decision-relevant information and discard everything else:

Kept (in the latent):
• Object positions and velocities
• Agent state (health, inventory)
• Goal-relevant features
• Dynamic elements that change with actions
Discarded:
• Exact pixel colors and textures
• Shadow patterns, lighting artifacts
• Background details that never change
• Sub-pixel rendering differences

This is why latent world models work so well for RL: the agent doesn't need to predict the exact shade of green on a leaf to decide whether to jump over a pit. It only needs to know: "pit ahead, 3 units away."

Training the Latent World Model

The world model is trained on three losses simultaneously, each teaching it something different:

L = Lrecon + β · LKL + Lreward
Loss TermWhat It TeachesTypical Weight
Lrecon = ||ôt - ot||2Encode enough to reconstruct the observation1.0
LKL = KL(posterior || prior)Prior should match posterior (imagination becomes accurate)β = 0.1 to 1.0
Lreward = ||r̂t - rt||2Predict rewards from latent state1.0
Why KL matters most: The KL loss is what makes imagination work. Without it, the prior (used during dreaming) and posterior (used during training with real data) would diverge. The prior would generate garbage latent states, and the policy trained in imagination would fail in reality. The β coefficient balances reconstruction quality against imagination quality — DreamerV2 introduced "KL balancing" to tune this automatically.
Check: Why do world models predict in latent space instead of pixel space?
🔨 Derivation Derive the Latent Dynamics Training Objective ✓ ATTEMPTED

You have a latent world model with encoder E, dynamics model f, decoder D, and reward predictor R. The model receives sequences of (ot, at, rt) from a replay buffer.

Your task: Derive why the training loss must have three terms (reconstruction + KL + reward), and why leaving any one out causes the model to fail at imagination.

If you remove Lrecon, the encoder is free to map all observations to the same latent z. The latent space becomes meaningless because there's no pressure to preserve information. The encoder could output z = [0, 0, ..., 0] for every image and the KL loss would be trivially minimized.
During imagination, you don't have observations — you can only use the prior p(zt | ht). During training with real data, you use the posterior q(zt | ht, ot). The KL loss forces prior ≈ posterior, so the prior (used in dreams) generates latents that look like real posteriors (used in training). Without KL, the prior would generate out-of-distribution latents during dreaming.
The actor and critic train inside the dream. They need reward signals to learn. If the world model can't predict rewards from latent states, the policy has no learning signal during imagination — it would only learn from real (sparse) rewards.

Full derivation:

We want to learn a latent model where imagination produces useful training signal. This requires:

1. Lrecon = ||D(zt) - ot||² — Forces the encoder to preserve decision-relevant information. Without it, z becomes trivial.

2. LKL = KL(q(z|h,o) || p(z|h)) — The ELBO derivation gives this naturally. The posterior q uses real observations; the prior p uses only history. Minimizing KL makes p generate z samples that look like those from q. Since imagination uses only p, this ensures dreamed latents are realistic.

3. Lreward = ||R(h,z) - r||² — The policy optimizes imagined rewards. If these are wrong, the policy learns to optimize the wrong objective. Reward prediction must be accurate for policy training to transfer to reality.

The key insight: This is actually the evidence lower bound (ELBO) for a sequential VAE. Lrecon is the likelihood term, LKL is the complexity penalty, and Lreward is an auxiliary task that makes the latent space reward-predictive. The three losses create a triangle of constraints: encode well ↔ imagine well ↔ evaluate well.

🔗 Pattern Recognition
The RSSM IS a Nonlinear Kalman Filter
Kalman Filter
t|t = x̂t|t-1 + Kt(zt - H x̂t|t-1)
Prior (predict) → Posterior (update with measurement)
World Model RSSM
zt ~ q(z | ht, ot) corrects prior p(z | ht)
Prior (dynamics only) → Posterior (update with observation)

Both systems maintain a belief state updated in two phases: predict (propagate forward using dynamics alone) then update (correct using new observation). The Kalman filter does this with linear Gaussians; the RSSM does it with neural networks and categorical distributions. The KL loss in the RSSM plays the exact role of the innovation covariance in the Kalman filter — it measures how surprised the system is by the observation.

Can you identify the RSSM's "Kalman gain" — the mechanism that controls how much the posterior shifts from the prior when a new observation arrives?

Checkpoint — Before you move on
Explain in your own words: why does imagination require BOTH the prior network AND the KL training loss? What would happen if you only had the posterior but no prior?
✓ Gate cleared
Model Answer

During imagination, you have NO observations — only the history of imagined states. The posterior q(z|h,o) requires an observation o, so you can't use it when dreaming. You need a prior p(z|h) that generates plausible latent states from history alone. But the prior is only useful if it generates states similar to what the posterior would produce with real data — which is exactly what the KL loss enforces. Without KL training, the prior generates random, unrealistic latents, and any policy trained in that imagination will fail in reality. The prior IS the imagination engine; the KL loss IS the quality control that keeps imagination grounded in reality.

Chapter 3: Dreamer (v1 → v3)

Dreamer is the most successful family of world-model agents. The architecture has three components: a world model (RSSM), an actor (policy), and a critic (value function). The key innovation: the actor and critic are trained entirely inside the world model's imagination.

The RSSM: Dreamer's Brain

The core of Dreamer is the Recurrent State-Space Model (RSSM). Unlike a simple latent vector, the RSSM splits state into two parts:

Deterministic state ht
GRU hidden state: [200] dims
Carries persistent history — the "memory" of what happened. Updated every step via the GRU recurrence.
Stochastic state zt
Categorical: 32 classes × 32 dims = [1024]
Captures uncertainty about the current state. Sampled from a learned distribution using straight-through gradients.

Total latent state: ht [200] + zt [1024] = 1,224 dimensions. The split matters: h carries deterministic dynamics (ball moving along a trajectory), z captures stochastic branching (which direction the ball bounces).

Prior (imagination)
ht = GRU(ht-1, zt-1, at-1)
t ~ prior_net(ht) — what the model expects
vs
Posterior (reality check)
zt ~ posterior_net(ht, encode(ot)) — what actually happened
Why two distributions? The prior predicts zt from history alone (for imagining). The posterior updates zt using the real observation (for training). The KL divergence between them is a training loss: it forces the prior to get better at predicting, so imagination becomes more accurate over time.

The Dreamer Training Loop

1. Collect Real Experience
Run policy in environment, store (ot, at, rt) in replay buffer
2. Train World Model
Encode ot → zt, run RSSM, minimize:
reconstruction loss + KL(posterior || prior) + reward prediction
3. Imagine H=15 Steps
From real (h, z), unroll prior forward:
for t = 1..15: h' = GRU(h, z, a), z' ~ prior(h'), r' = reward_net(h', z')
Shape: [B × 15 × 1224] imagined trajectory
4. Train Actor-Critic
Backprop through entire imagined rollout!
∇θ = d(return) / d(θ) through all 15 steps
↻ repeat
The data efficiency multiplier: Each real step generates 15 imagined steps of learning signal (H=15 default). This is why Dreamer needs 50× fewer environment interactions than model-free PPO — one real experience seeds 15 imagined ones, and backprop flows through the entire imagined trajectory.
VersionYearKey Improvement
Dreamer v12020Latent imagination + value estimation
Dreamer v22021Discrete latents (32×32 categorical), KL balancing, Atari mastery
DreamerV32023Symlog predictions, fixed hyperparams, works across domains unchanged
Dreamer's Imagination Rollout

The agent imagines future states from the current state. Teal = real states, orange = imagined futures, green = rewards predicted. Notice how opacity fades — the model is less certain further into the future.

Imagination horizon12

Latent Imagination: What Actually Happens Step by Step

Here's the exact computation for one imagination step. Starting from a real state (h3, z3) after 3 real environment steps:

pseudocode
# Current real state
h3 = [200]   # GRU hidden state from real experience
z3 = [1024]  # Stochastic state (32 classes × 32 dims)

# === Imagination step 1 ===
a3 = actor(h3, z3)              # Policy outputs action [action_dim]
h4 = GRU(h3, cat(z3, a3))      # Deterministic update: [200]
z4 ~ categorical(prior_net(h4)) # Sample stochastic: [1024]
r4 = reward_net(h4, z4)        # Predicted reward: scalar

# === Imagination step 2 ===
a4 = actor(h4, z4)
h5 = GRU(h4, cat(z4, a4))
z5 ~ categorical(prior_net(h5))
r5 = reward_net(h5, z5)

# ... repeat for H=15 total steps
# Then backprop ∇θ through ALL 15 steps!
The key trick: Because everything is differentiable (GRU, MLP, categorical with straight-through gradients), we can backpropagate the policy gradient through the entire imagined trajectory. The actor learns: "if I take action A at step 3, through a chain of 12 more imagined steps, I get this total reward." This is fundamentally different from model-free RL, which can only learn from actually-experienced rewards.
DreamerV3's breakthrough: It was the first algorithm to collect diamonds in Minecraft from scratch — a task requiring long-horizon planning, exploration, and tool crafting — all learned through imagination. Same hyperparameters on Atari, DMC, Minecraft, and Crafter. No tuning per domain.
Check: Where does Dreamer train its policy?
🔨 Derivation Why the Posterior Uses Observations (Not Just History) ✓ ATTEMPTED

The RSSM has two paths to estimate zt: the prior p(zt | ht) which uses only the deterministic history, and the posterior q(zt | ht, ot) which also sees the current observation.

Your task: Derive from Bayes' rule why the posterior must condition on ot, and explain the information-theoretic reason the posterior is always better than the prior for training.

The generative model factorizes as: p(o1:T, z1:T | a1:T) = ∏t p(ot | zt, ht) · p(zt | ht). The prior p(zt | ht) is the model's best guess BEFORE seeing ot. After seeing ot, Bayes gives: q(zt | ht, ot) ∝ p(ot | zt, ht) · p(zt | ht).
Training the world model requires computing ∇Lrecon. If zt is sampled from the prior (which ignores ot), the sampled z might be far from what produced ot, giving high-variance gradients. The posterior concentrates probability on z values that actually explain ot, giving low-variance, informative gradients for all downstream networks.
log p(ot) ≥ Eq[log p(ot|zt)] - KL(q || p). The tighter the posterior approximates the true posterior, the tighter this bound. Using the prior for training would give a loose bound — equivalent to maximizing a lower bound on a lower bound.

Full derivation:

1. Generative model: zt ~ p(z|ht), ot ~ p(o|zt,ht). The prior captures what the model expects before seeing reality.

2. Bayes' rule: p(zt|ht,ot) = p(ot|zt,ht) · p(zt|ht) / p(ot|ht). The observation ot sharpens the distribution over z — eliminating latent states that couldn't have produced this observation.

3. For training: We approximate this true posterior with q(z|h,o) = neural_net(h,encode(o)). The ELBO gives: log p(o) ≥ Eq[log p(o|z)] - KL(q||p). Using q (posterior) for training maximizes a tight bound. Using p (prior) would be sampling z randomly, hoping some explain o — astronomically wasteful.

4. Information gain: I(zt; ot | ht) = KL(q || p) measures exactly how much information the observation provides about the latent. A large KL means the observation is highly informative — the prior was surprised.

The key insight: The posterior is the "teacher" that shows the prior what z values actually correspond to real observations. Training with the prior alone would be like studying for an exam without an answer key — you'd never know which of your guesses were correct.

🔨 Derivation The Dreaming Objective — Backprop Through Imagination ✓ ATTEMPTED

Dreamer trains its policy by backpropagating through H imagined steps. The actor πθ selects actions, the world model produces next states, and a critic estimates values.

Your task: Write out the objective J(θ) that the actor maximizes during dreaming, and show why gradients can flow through the imagination (unlike model-free RL where they can't).

The actor maximizes the sum of imagined rewards plus a terminal value estimate: J(θ) = ∑t=1H γt-1t + γH V(ŝH). Here r̂t = reward_net(ht, zt) and V is the critic's value estimate at the end of the imagination horizon.
In model-free RL, the environment is a black box — you can't differentiate through physics. But in Dreamer, the "environment" IS the world model (a differentiable neural network). The chain is: θ → at → GRU → ht+1 → prior → zt+1 → reward_net → rt+1. Every link is differentiable (categorical uses straight-through estimator).
θ rt+1 = (dr/dz)(dz/dh)(dh/da)(da/dθ). Each partial is a Jacobian of a neural network layer — all computable via autograd. For H steps, you just extend the chain: da/dθ at step 1 affects h at step 2, which affects z at step 2, which affects r at step 2, which affects a at step 2, which affects h at step 3...

Full derivation:

1. The objective: J(θ) = Eπθ[∑t=1H γt-1t + γH Vψ(hH, zH)]

2. The computation graph (one step):
at = πθ(ht, zt)    [differentiable — policy is a neural net]
ht+1 = GRU(ht, [zt, at])    [differentiable — GRU is a neural net]
zt+1 ~ Cat(prior(ht+1))    [differentiable via straight-through]
rt+1 = reward_net(ht+1, zt+1)    [differentiable]

3. Gradient for step t through step t+k:
θ rt+k = (∂r/∂st+k) · (∏i=0k-1 ∂st+i+1/∂at+i · ∂at+i/∂st+i) · (∂at/∂θ)

4. Why model-free can't do this: In model-free RL, st+1 = env(st, at). The "env" function is the real world — no Jacobian exists. You must use REINFORCE (high variance) or temporal-difference (bootstrapping). Dreamer replaces env() with a differentiable neural net, so direct backprop works.

The key insight: Dreamer converts RL into supervised learning. The actor gets a dense, low-variance gradient signal by differentiating through the world model, exactly like training a classifier. This is why it's so much more sample-efficient than policy gradient methods that rely on noisy reward-weighted log-probabilities.

💥 Break-It Lab What Dies When You Remove RSSM Components? ✓ ATTEMPTED
The RSSM has three critical components working together: the observation model (encoder + decoder), the stochastic path (zt sampling), and the reconstruction loss. Each prevents a specific failure mode. Toggle them off and watch what breaks.
Remove Observation Model ACTIVE
Failure mode: Imagination without grounding. Without observations correcting the latent state, the prior drifts freely. After a few steps, the imagined world bears no resemblance to reality. The agent's dreams become fantasies — it might imagine getting rewards that don't exist. Equivalent to a Kalman filter that never incorporates measurements: the prediction error grows without bound.
Remove Stochastic Path (zt) ACTIVE
Failure mode: Deterministic collapse. With only the deterministic ht, the model can only predict one future per state. At branching points (will the enemy go left or right?), it averages all possibilities into a blurry mean prediction. Rewards become averaged too — the agent can't distinguish between "50% chance of +10" and "certain +5". It becomes blind to risk and opportunity.
Remove Reconstruction Loss ACTIVE
Failure mode: Latent drift / information loss. Without reconstruction pressure, the encoder has no reason to preserve visual information in z. The latent space collapses — all observations map to similar z values. The dynamics model might predict perfectly (ẑt+1 ≈ zt+1) but the latents carry no information about the actual state. The reward predictor and policy have nothing meaningful to work with.
💻 Build It Implement the RSSM Prior/Posterior Update Step ✓ ATTEMPTED
Implement one step of the RSSM that computes both the prior (for imagination) and posterior (for training). The GRU updates the deterministic state h, then two separate MLPs produce the prior and posterior distributions over z.
signature def rssm_step(prev_h, prev_z, action, observation, gru, prior_net, posterior_net, encoder): """ One RSSM step: compute deterministic state, prior, and posterior. Args: prev_h: [batch, 200] - previous GRU hidden state prev_z: [batch, 1024] - previous stochastic state (32 classes x 32 dims) action: [batch, action_dim] - action taken observation: [batch, 3, 64, 64] - current observation (or None during imagination) gru: GRUCell(input=1024+action_dim, hidden=200) prior_net: Linear(200, 32*32) - produces logits for prior posterior_net: Linear(200+embed_dim, 32*32) - produces logits for posterior encoder: CNN that maps [3,64,64] -> [embed_dim] Returns: h: [batch, 200] - new deterministic state prior_logits: [batch, 32, 32] - prior distribution parameters posterior_logits: [batch, 32, 32] or None - posterior (None during imagination) z: [batch, 1024] - sampled stochastic state (from posterior if available, else prior) """
Test case
h, prior_logits, post_logits, z = rssm_step(prev_h, prev_z, action, obs, gru, prior_net, post_net, enc)
assert h.shape == (B, 200)
assert prior_logits.shape == (B, 32, 32)
assert post_logits.shape == (B, 32, 32)
assert z.shape == (B, 1024) # 32*32 flattened one-hots
Use straight-through gradients: sample discrete one-hot from the categorical, but in the backward pass, pass gradients through the softmax probabilities. In PyTorch: z_onehot = F.one_hot(logits.argmax(-1), 32) for forward, z = z_onehot + probs - probs.detach() for gradient flow. Then flatten the 32×32 one-hots to get [batch, 1024].
python
def rssm_step(prev_h, prev_z, action, observation,
              gru, prior_net, posterior_net, encoder):
    # Step 1: Deterministic state update
    # GRU input = concat(prev_z, action)
    gru_input = torch.cat([prev_z, action], dim=-1)  # [B, 1024+act_dim]
    h = gru(gru_input, prev_h)  # [B, 200]

    # Step 2: Prior (what the model expects from dynamics alone)
    prior_logits = prior_net(h).reshape(-1, 32, 32)  # [B, 32, 32]

    # Step 3: Posterior (correct belief using real observation)
    if observation is not None:
        embed = encoder(observation)  # [B, embed_dim]
        post_input = torch.cat([h, embed], dim=-1)  # [B, 200+embed_dim]
        posterior_logits = posterior_net(post_input).reshape(-1, 32, 32)
        logits_to_sample = posterior_logits  # use posterior during training
    else:
        posterior_logits = None
        logits_to_sample = prior_logits  # use prior during imagination

    # Step 4: Sample z with straight-through gradients
    probs = F.softmax(logits_to_sample, dim=-1)  # [B, 32, 32]
    indices = torch.argmax(logits_to_sample, dim=-1)  # [B, 32]
    z_onehot = F.one_hot(indices, 32).float()  # [B, 32, 32]
    # Straight-through: forward uses discrete, backward uses continuous
    z_st = z_onehot + probs - probs.detach()  # [B, 32, 32]
    z = z_st.reshape(-1, 32 * 32)  # [B, 1024]

    return h, prior_logits, posterior_logits, z
Bonus challenge: Extend this to support KL balancing (DreamerV2): use a mixture of free-nats clipping and KL in both directions, α·KL(sg(post)||prior) + (1-α)·KL(post||sg(prior)) where sg = stop gradient.
🔗 Pattern Recognition
Latent Dynamics = A VAE at Every Timestep
Standard VAE
Encoder: x → q(z|x)
Decoder: z → p(x|z)
Loss: recon + KL(q(z|x) || p(z))
RSSM (per timestep)
Posterior: ot, ht → q(zt|ht,ot)
Decoder: zt, ht → p(ot|zt,ht)
Loss: recon + KL(posterior || prior) → VAE lesson

The RSSM is a conditional VAE at each timestep, where the conditioning context is the deterministic history ht. The prior p(z) in a standard VAE is fixed (usually N(0,1)); in the RSSM, it's learned — the prior network predicts what z should be from dynamics alone. This is what makes the "dreaming" mode possible: the learned prior replaces the observation.

In a standard VAE, KL(q||p) regularizes toward a fixed prior. In the RSSM, what does KL(posterior||prior) actually measure about the world model's understanding?

Chapter 4: JEPA — Joint Embedding Predictive Architecture

Yann LeCun proposed JEPA as an alternative to generative world models. Instead of predicting what the next observation looks like (pixel reconstruction), JEPA predicts the abstract representation of the next state. This avoids wasting capacity on irrelevant details.

predict: embed(xt+1) ≈ predictor(embed(xt), at)

The key difference from autoencoders: JEPA never reconstructs pixels. Both the target and prediction live in embedding space. A VICReg or similar loss prevents the embeddings from collapsing to trivial solutions.

The Architecture: Three Networks

JEPA has a surprisingly simple implementation with three components that must be carefully balanced:

Context Encoder fθ
Encodes the current observation xt → zt
ViT backbone, trained with gradients
Predictor gφ
MLP or small transformer: predicts ẑt+1 from zt
Trained with gradients
match with ↓
Target Encoder f̄θ
Encodes the actual next observation xt+1 → z*t+1
EMA update only: θ̄ ← τθ̄ + (1-τ)θ, τ=0.996
Why EMA (Exponential Moving Average)? The target encoder doesn't get gradients — it's a slowly-moving copy of the context encoder. This creates a stable prediction target that changes slowly enough for the predictor to learn, but fast enough to improve over time. Without this, both encoders could agree to output zeros — perfect match, zero learning. The asymmetry breaks this collapse.

The VICReg Loss: Three Terms, Each Essential

The loss that prevents collapse has three components. Each solves a different failure mode:

TermWhat It DoesWhat Breaks Without It
InvarianceMSE between predicted ẑt+1 and target z*t+1Predictions wouldn't match reality
VarianceForces std(z) ≥ 1 along each dimensionAll embeddings collapse to a single point (constant output)
CovarianceDecorrelates embedding dimensionsAll dimensions encode the same information (rank collapse)
L = λinv · ||ẑ - z*||2 + λvar · max(0, 1 - std(z)) + λcov · ||off_diag(cov(z))||2
Think of it as a force balance: Invariance pulls predicted and target embeddings together. Variance pushes embeddings apart (prevents collapse). Covariance spreads information across dimensions. All three must be active simultaneously — remove any one and the representation degrades.
JEPA vs Generative Prediction

Compare: generative models predict pixels (expensive, noisy), JEPA predicts embeddings (cheap, abstract).

V-JEPA: From Images to Video Understanding

V-JEPA (Video JEPA, Meta 2024) extends the idea to video. Instead of predicting the embedding of the next frame, it predicts the embedding of a masked spacetime region:

Input Video
16 frames, split into spacetime patches
90% of patches are masked (hidden from context encoder)
Context Encoder + Predictor
Encode visible 10% patches, predict embeddings for masked 90%
match with ↓
Target Encoder (EMA)
Encode all patches (including masked ones) → target embeddings
The result: V-JEPA learns strong video representations without ever generating pixels. It understands motion, causality, and physical interactions — all from predicting abstract features. On video classification benchmarks, it matches or beats models that train on 10× more compute. This suggests that learning "what matters" is more efficient than learning "every pixel."
LeCun's vision: JEPA is a stepping stone toward autonomous machine intelligence. By learning abstract world models that capture what matters for planning (not pixel-perfect reconstruction), agents could reason at a human-like level of abstraction. The missing piece: connecting JEPA representations to action and reward for actual agent behavior.
Check: What does JEPA predict?

Chapter 5: Video Prediction — Genie & UniSim

The latest world models don't just predict latent states — they generate entire videos of what will happen next. This is world modeling at scale: train on millions of internet videos and learn a general-purpose simulator of the visual world.

How Video Tokens Work

Video generation models (like Sora) extend image generation to the temporal dimension. The key architectural idea: treat video as a 3D grid of spacetime patches:

Raw Video
T frames × H × W × 3 (e.g., 16 × 256 × 256 × 3)
↓ 3D VAE encoder (compress spatially + temporally)
Latent Video
T/4 × H/8 × W/8 × d (e.g., 4 × 32 × 32 × 16)
= 65,536 latent tokens
↓ DiT (Diffusion Transformer)
Temporal Attention
Self-attention across time: each spatial position attends to itself across all frames.
Spatial attention across space: each frame is a standard image transformer.
↓ 3D VAE decoder
Generated Video
16 × 256 × 256 × 3 RGB frames
Why video is harder than images: A single 256×256 image has ~1,024 latent tokens. A 16-frame video has ~65,536 tokens. Self-attention over all of them costs 65,5362 = 4.3 billion operations. This is why video models use factored attention: spatial-only and temporal-only attention blocks, never full 3D attention.
ModelKey IdeaTraining DataArchitecture
Genie (DeepMind)Learn actions from unlabeled videoInternet gameplay videosVQ-VAE + masked transformer
UniSim (Google)Universal simulator of visual experienceInternet video + imagesCascaded diffusion
Sora (OpenAI)Diffusion transformer, implicit physicsInternet videoDiT with spacetime patches
Cosmos (NVIDIA)World foundation model for physical AIDriving + robotics videoAutoregressive + diffusion
Video World Model: Frame Prediction

Given the current frame and an action, the model predicts future frames. Watch how prediction quality degrades over longer horizons.

Prediction horizon4
From video model to world model: If a video model can predict what happens when you push a cup off a table (it falls), it has implicitly learned gravity. Video prediction at scale may be the path to general-purpose physical understanding. The question: can you act inside these models? Genie says yes — it learns an action space from pure video, making the generated world interactive.

Genie: Action Discovery from Unlabeled Video

Genie's key insight is radical: you don't need labeled actions to build an interactive world model. From millions of unlabeled gameplay videos, Genie automatically discovers a latent action space:

Two Consecutive Frames
framet: [256, 256, 3], framet+1: [256, 256, 3]
↓ latent action encoder
Discovered Action
at: one of 8 discrete actions (up/down/left/right/jump/...)
Learned entirely from visual differences between frames
↓ at inference: human provides discrete action
Generated Next Frame
framet+1 = model(framet, at) — playable world!
From passive video to interactive world: Genie trained on 200,000 hours of 2D platformer videos with zero action labels. At inference, a user can control the character by selecting actions. The model has learned physics (gravity, collisions), game logic (enemies, platforms), and visual consistency — all from pure observation.
Check: What makes Genie special compared to traditional world models?

Chapter 6: Planning with World Models

Having a world model is only useful if you can use it to make decisions. Planning algorithms search through imagined futures to find the best action sequence. Major approaches:

MethodHow It PlansUsed In
Random ShootingSample many action sequences, pick bestPETS
CEM (Cross-Entropy)Iteratively refine action distributionPETS, TD-MPC
MCTS (Tree Search)Build a search tree of statesMuZero, EfficientZero
Backprop through modelGradient-based trajectory optimizationDreamer

CEM / MPPI: Concrete Shapes

The Cross-Entropy Method (CEM) is the workhorse planning algorithm. Here's exactly what happens at each decision step for a robot arm with 6 joints and a horizon of H=12 steps:

1. Sample N Action Sequences
Draw N=512 trajectories from Gaussian:
actions: [512, 12, 6] — 512 plans, each 12 steps, 6-dim action
2. Imagine Each Plan
For each of 512 plans, roll out world model 12 steps:
states: [512, 12, latent_dim], rewards: [512, 12]
Total: 6,144 imagination steps in one batch
3. Rank by Total Reward
Sum rewards per trajectory: [512] scores
Keep top K=64 (elite fraction = 12.5%)
4. Refit Distribution
Update μ, σ from the 64 best trajectories
Repeat 3-5 iterations, then execute first action
The real cost of planning: At each decision step, CEM imagines 512 × 12 = 6,144 future states. Over 3 refinement iterations, that's ~18,000 world model forward passes per action. This is why the world model must be fast (latent space, not pixels) and why GPU parallelism matters — all 512 trajectories are batched in one forward pass.
Planning by Random Shooting

The agent imagines many possible futures (gray) and picks the one with the highest reward (green). Click to re-plan.

Samples30

Dreamer vs CEM: Different Planning Philosophies

Dreamer and CEM-based methods (like TD-MPC) represent fundamentally different approaches to using a world model:

PropertyDreamer (Backprop)CEM (Sampling)
How it plansGradient of reward w.r.t. policy through imagined trajectorySample many action sequences, keep the best
ProducesAn amortized policy (reusable)A plan for right now (replan each step)
Cost at test timeOne forward pass (cheap)Thousands of forward passes (expensive)
HandlesContinuous + discrete actionsContinuous actions primarily
WeaknessPolicy may be suboptimal in novel statesExpensive at decision time, action space must be small
Think of it this way: Dreamer is like studying for an exam — you practice (imagine) beforehand and perform quickly on test day. CEM is like having open-book access during the exam — slower, but you can reason about novel questions on the spot. In practice, TD-MPC2 combines both: an amortized policy gives an initial guess, then CEM refines it.
MuZero's triumph: DeepMind's MuZero learns a world model + uses MCTS planning to master Go, Chess, Shogi, and Atari — all with the same algorithm, no rules provided. It imagines game states and searches for the best move. The key insight: MuZero's world model doesn't predict observations at all — it only predicts rewards, values, and policy priors in latent space. This is enough for planning.
Check: How does CEM (Cross-Entropy Method) plan?
🏗 Design Challenge You're the Architect: Autonomous Driving World Model ✓ ATTEMPTED
Your team is building a world model for an autonomous vehicle. The system must predict 3 seconds into the future to enable safe planning. You have 10 cameras (surround view), LiDAR, and a 100ms planning budget per decision cycle. The model runs on an NVIDIA Orin (275 TOPS INT8, 32GB shared memory).
Input
10 cameras @ 1920×1080 + LiDAR point cloud (150K points)
Planning budget
100ms total (encode + imagine + plan)
Prediction horizon
3 seconds @ 10Hz = 30 future steps
Memory
32GB shared (also running perception, mapping, etc.)
Safety requirement
Must detect when predictions are unreliable
1. Latent size? 30 steps of rollout means compounding error. Larger latent = more info preserved but slower rollout. What dimensionality?
2. How do you fuse 10 cameras + LiDAR into a single latent state? BEV? Per-camera encoding? Token fusion?
3. Do you predict in pixel space (for visualization/debugging) or pure latent (for speed)? Or both with different models?
4. How do you handle the 30-step horizon within 100ms? Can you afford CEM with 512 samples × 30 steps?
5. How do you detect unreliable predictions for safety? Ensemble disagreement? Learned uncertainty?

Real-world solutions (NVIDIA Cosmos, Wayve GAIA-1, Tesla):

Latent size: ~256-512 dims for the state. BEV representation: 200×200 grid at 0.5m resolution with 64-128 channels — but compressed to ~512-dim summary for temporal dynamics. The full BEV is used for spatial reasoning; the compressed state for temporal rollout.

Sensor fusion: Bird's Eye View (BEV) is the winning approach. Each camera is projected into a shared BEV grid using LSS (Lift-Splat-Shoot) or BEVFormer-style cross-attention. LiDAR provides direct 3D supervision. This gives a unified spatial representation regardless of camera count.

Prediction space: Industry uses BOTH. A fast latent model for planning (runs in ~20ms for 30 steps), plus a slower video prediction model for visualization and validation. The latent model predicts occupancy, velocity fields, and agent trajectories — not pixels.

Planning budget: 512×30 CEM steps = 15,360 forward passes. At ~2μs per latent step on Orin, that's ~30ms — feasible! But most production systems use a learned policy (like Dreamer) for the initial 90% of situations, falling back to CEM only for complex scenarios. This gives a ~5ms typical + 50ms worst-case split.

Uncertainty: Ensemble of 5 world models. When they disagree on the predicted BEV occupancy by more than a threshold, the system flags "unreliable prediction" and reverts to a conservative emergency policy. This is epistemic uncertainty — knowing what you don't know.

⚔ Adversarial: Your CEM planner consistently chooses actions that look great in imagination but fail in reality. The world model's training loss is low. What's the most likely cause?
You've trained a world model with low reconstruction loss and low KL. CEM planning looks beautiful in imagined rollouts — predicted rewards are high. But when executed in the real environment, the plan fails at step 3-4. The model's one-step predictions are accurate.

Chapter 7: Open Problems

World models are powerful but far from solved. Key challenges remain:

ProblemWhy It's HardCurrent Approaches
Compounding errorsPrediction errors accumulate over long rolloutsShorter horizons, latent space, ensembles
Partial observabilityThe agent can't see everythingRecurrent state (RSSM), memory
Stochastic environmentsMultiple futures are possibleStochastic latents, discrete codes
GeneralizationTransfer between environmentsFoundation world models (Genie, Cosmos)
Computational costPlanning is expensive at test timeAmortized policies, model distillation

Compounding Errors: The Numbers

If your world model has 2% per-step prediction error, how bad does it get over long horizons?

HorizonCumulative ErrorUsable?
H = 5~10%Solid — plans are reliable
H = 15~26%Dreamer's sweet spot — noisy but useful
H = 30~45%Degraded — imagination diverges from reality
H = 50~64%Broken — model is hallucinating entire scenarios

This is why Dreamer uses H=15 by default. Beyond that, the imagined world is too different from reality for the policy to transfer. The error compounds roughly as 1 - (1 - ε)H, so even a "good" model becomes unreliable if you dream too far ahead.

The open-world failure: Compounding error gets much worse when the agent enters situations the model hasn't seen. A Dreamer agent trained in one area of Minecraft will imagine reasonable physics for familiar terrain. Enter a new biome with different dynamics (water, lava, ice) and the imagination produces nonsense — leading to dangerous plans based on wrong predictions.
Compounding Error over Horizon

Watch how prediction error grows with each imagined step. The red band shows the uncertainty growing. Move the slider to see how model quality affects the usable horizon.

Model quality60%

The Open-World Problem

Compounding error gets dramatically worse when the agent encounters situations outside its training distribution. Consider a robot trained in a warehouse:

ScenarioModel QualityWhy
Picking known boxesExcellentSeen thousands of times in training
New box shapeGoodSimilar dynamics, model generalizes
Wet floorPoorDifferent friction — model never saw this
Human walks through workspaceTerribleCompletely novel dynamic object — imagination hallucinates
The core challenge: In novel situations, the world model's predictions become unreliable, but it doesn't know they're unreliable. The agent confidently plans based on wrong predictions. Model epistemic uncertainty (knowing what you don't know) is the key missing piece. Some approaches: train an ensemble of models and disagree = uncertainty, or use Bayesian neural networks to estimate prediction confidence.
The fundamental tension: Long planning horizons give better decisions but accumulate more error. Short horizons are accurate but myopic. Dreamer's H=15 works. H=50 degrades. Making models accurate enough for H=100+ — enabling long-term strategic planning — is the central open problem in world modeling.
Check: Why do world model predictions degrade over long horizons?
⚔ Adversarial: Your world model predicts perfectly for 5 steps then diverges catastrophically at step 6+. Training loss is low. What's failing?
You're training an RSSM world model on a robot manipulation task. During evaluation, you roll out the prior for imagination. Steps 1-5 are eerily accurate — predicted states match reality within 2% error. At step 6, the model suddenly produces nonsensical states that don't correspond to any physical reality. The training loss (reconstruction + KL + reward) converges to a low value.

Chapter 8: The Big Picture

World models sit at the intersection of reinforcement learning, generative modeling, and representation learning. They're the foundation for agents that can reason about consequences before acting — a capability that separates reactive systems from truly intelligent ones.

Method Comparison

ApproachPrediction SpacePlanning MethodData EfficiencyBest For
Dreamer (RSSM)Latent (1224-dim)Backprop through modelVery highContinuous control, Atari, Minecraft
MuZeroLatent (abstract)MCTS tree searchHighBoard games, discrete Atari
JEPAEmbedding spaceNot yet integratedN/A (repr. learning)Video understanding, pretraining
Sora / CosmosPixel space (video)Not yet used for planningLow (needs internet-scale data)Video generation, simulation
TD-MPC2LatentCEM + value functionHighRobotics, multi-task control
Generative AI
Image/video generation → visual world models
+
RL / Planning
Decision-making → acting in imagined worlds
=
World Models
Agents that imagine, plan, and act intelligently

Connections

World models connect deeply to other topics on this site:

Diffusion Models — Sora and Cosmos use diffusion as the generation backbone for video world models.

VLAs (Vision-Language-Action) — World models provide the "imagination engine" for robot planning.

RL Algorithms — Dreamer is model-based RL; compare with model-free PPO and SAC.

Contrastive Learning (CLIP) — JEPA's VICReg loss is a non-contrastive alternative to CLIP's InfoNCE.

MDPs — World models learn the transition function T(s'|s,a) that defines an MDP.

"A world model is a predictive engine that allows an agent to imagine the consequences of actions without performing them."
— David Ha & Jürgen Schmidhuber

You now understand how AI learns to dream. The ability to simulate the future in imagination — to ask "what if?" before committing to action — may be the most important capability an intelligent agent can have.

Check: What is the unifying idea behind all world models?