microWorld Models — Learning to Imagine the Future

Chapter 1: Model-Based Reinforcement Learning

Before we dive in, let's define three RL terms we'll use throughout this lesson. A policy is a rule that tells the agent what action to take in each situation — think of it as the agent's strategy or decision-making function. A reward is a number the environment gives after each action, telling the agent how well it did. Reinforcement learning (RL) is the process of learning a good policy by trial and error — trying actions, observing rewards, and adjusting the strategy.

With those in hand: in model-free RL, the agent learns a policy directly from experience — try things, get rewards, adjust. In model-based RL, the agent first learns a dynamics model (a simulator of how the world works), then uses that model to plan or generate synthetic experience.

Model-Free

state → policy → action (learned from real experience)

Model-Based

state → world model → imagined rollouts → plan → action

Sample Efficiency: Model-Free vs Model-Based

Model-based methods learn faster because they extract more from each real experience. Adjust the model accuracy to see the trade-off.

Model accuracy70%

The trade-off: Model-based RL is more sample-efficient (needs fewer real interactions) but the model can be wrong. A bad world model leads to plans that fail in reality. The key challenge is learning accurate enough models.

The Data Efficiency Gap

How big is the difference in practice? On the Atari 100K benchmark (only 100K environment steps allowed — about 2 hours of human play):

Method	Type	Human-Normalized Score	Env Steps
Random Agent	—	0%	—
DQN	Model-free	15%	100K
PPO	Model-free	20%	100K
DreamerV2	Model-based	51%	100K
EfficientZero	Model-based	116%	100K
Human	—	100%	—

EfficientZero beats humans with just 2 hours of play. DQN, a model-free method, barely scratches the surface with the same data. The difference: EfficientZero learns a world model and imagines millions of additional game states. Every real experience gets amplified through imagination.

The World Model Function Signature

At its core, a world model is just a function with a very specific signature:

s_t+1, r_t = f(s_t, a_t)

Given the current state s_t and an action a_t, predict the next state s_t+1 and the reward r_t. Everything else — RSSM, latent spaces, categorical distributions — is engineering to make this function more accurate, more efficient, or more general.

Three Ways to Use a World Model

Once you have a dynamics model, there are three distinct strategies for exploiting it:

Strategy	How It Works	Example
Background planning	Generate synthetic data, add to replay buffer (a memory bank that stores past experiences so the agent can re-learn from them), train model-free agent on mixed real + imagined data	Dyna-Q, MBPO
Decision-time planning	At each real step, imagine many futures, pick the best first action	CEM, MCTS, MuZero
Analytic gradients	Backpropagate reward gradients through the model to optimize the policy	Dreamer, SVG

Each has trade-offs: Background planning is simple but slow to propagate information. Decision-time planning is powerful but expensive at every step. Analytic gradients are elegant but require the model to be differentiable end-to-end. Dreamer uses the third approach; MuZero uses the second.

Check: What is the main advantage of model-based over model-free RL?

It's more sample-efficient (learns from fewer real interactions) It always finds the optimal policy It doesn't need a neural network

Chapter 2: Latent Dynamics

Predicting the next image pixel-by-pixel is expensive and wasteful — most pixels don't matter for decision-making. Instead, modern world models work in a latent space: they encode observations into compact representations and predict dynamics in that compressed space.

z_t = encode(o_t) → ẑ_t+1 = dynamics(z_t, a_t) → ô_t+1 = decode(ẑ_t+1)

Pixel Space vs Latent Space

Watch how a high-dimensional observation (left) is compressed into a small latent vector (middle), and prediction happens there. Toggle between pixel and latent prediction.

Why latent? A 64×64 RGB image has 12,288 dimensions. A latent vector might have 200. Predicting in latent space is 60× cheaper, and the encoder learns to keep only decision-relevant information (object positions, velocities) while discarding noise (exact textures, shadows).

Concrete Shapes: Latent World Model

Let's trace the exact tensor shapes in a typical latent dynamics model (as used in PlaNet/Dreamer):

Raw Observation

o_t: [64, 64, 3] = 12,288 dimensions

↓ CNN encoder (4 conv layers)

Encoder Output

conv4: [4, 4, 256] = 4,096 dims → flatten → [4096]
Linear(4096, 200) → z_t: [200]

↓ dynamics model f(z_t, a_t)

Latent Prediction

MLP: [200 + action_dim] → 200 → 200
ẑ_t+1: [200]

↓ decoder (optional, for visualization)

Reconstructed Observation

Linear(200, 4096) → reshape → deconv → [64, 64, 3]

The compression ratio: From 12,288 dims to 200 dims = 61× compression. And prediction in this space is just an MLP forward pass — microseconds, not milliseconds. That's why you can imagine thousands of future steps in the time it takes to render one real frame.

What Gets Kept, What Gets Discarded?

The encoder learns to keep decision-relevant information and discard everything else:

Kept (in the latent):
• Object positions and velocities
• Agent state (health, inventory)
• Goal-relevant features
• Dynamic elements that change with actions

Discarded:
• Exact pixel colors and textures
• Shadow patterns, lighting artifacts
• Background details that never change
• Sub-pixel rendering differences

This is why latent world models work so well for RL: the agent doesn't need to predict the exact shade of green on a leaf to decide whether to jump over a pit. It only needs to know: "pit ahead, 3 units away."

Training the Latent World Model

The world model is trained on three losses simultaneously, each teaching it something different:

L = L_recon + β · L_KL + L_reward

Loss Term	What It Teaches	Typical Weight
L_recon = \|\|ô_t - o_t\|\|²	Encode enough to reconstruct the observation	1.0
L_KL = KL(posterior \|\| prior)	Prior should match posterior (imagination becomes accurate)	β = 0.1 to 1.0
L_reward = \|\|r̂_t - r_t\|\|²	Predict rewards from latent state	1.0

Why KL matters most: The KL loss is what makes imagination work. Without it, the prior (used during dreaming) and posterior (used during training with real data) would diverge. The prior would generate garbage latent states, and the policy trained in imagination would fail in reality. The β coefficient balances reconstruction quality against imagination quality — DreamerV2 introduced "KL balancing" to tune this automatically.

Check: Why do world models predict in latent space instead of pixel space?

It's cheaper and focuses on decision-relevant features Pixels can't be predicted by neural networks Latent spaces are always more accurate

🔨 Derivation Derive the Latent Dynamics Training Objective ▶ ✓ ATTEMPTED

You have a latent world model with encoder E, dynamics model f, decoder D, and reward predictor R. The model receives sequences of (o_t, a_t, r_t) from a replay buffer.

Your task: Derive why the training loss must have three terms (reconstruction + KL + reward), and why leaving any one out causes the model to fail at imagination.

If you remove L_recon, the encoder is free to map all observations to the same latent z. The latent space becomes meaningless because there's no pressure to preserve information. The encoder could output z = [0, 0, ..., 0] for every image and the KL loss would be trivially minimized.

During imagination, you don't have observations — you can only use the prior p(z_t | h_t). During training with real data, you use the posterior q(z_t | h_t, o_t). The KL loss forces prior ≈ posterior, so the prior (used in dreams) generates latents that look like real posteriors (used in training). Without KL, the prior would generate out-of-distribution latents during dreaming.

The actor and critic train inside the dream. They need reward signals to learn. If the world model can't predict rewards from latent states, the policy has no learning signal during imagination — it would only learn from real (sparse) rewards.

Full derivation:

We want to learn a latent model where imagination produces useful training signal. This requires:

1. L_recon = ||D(z_t) - o_t||² — Forces the encoder to preserve decision-relevant information. Without it, z becomes trivial.

2. L_KL = KL(q(z|h,o) || p(z|h)) — The ELBO derivation gives this naturally. The posterior q uses real observations; the prior p uses only history. Minimizing KL makes p generate z samples that look like those from q. Since imagination uses only p, this ensures dreamed latents are realistic.

3. L_reward = ||R(h,z) - r||² — The policy optimizes imagined rewards. If these are wrong, the policy learns to optimize the wrong objective. Reward prediction must be accurate for policy training to transfer to reality.

The key insight: This is actually the evidence lower bound (ELBO) for a sequential VAE. L_recon is the likelihood term, L_KL is the complexity penalty, and L_reward is an auxiliary task that makes the latent space reward-predictive. The three losses create a triangle of constraints: encode well ↔ imagine well ↔ evaluate well.

🔗 Pattern Recognition

The RSSM IS a Nonlinear Kalman Filter

Kalman Filter

x̂_t|t = x̂_t|t-1 + K_t(z_t - H x̂_t|t-1)
Prior (predict) → Posterior (update with measurement)

World Model RSSM

z_t ~ q(z | h_t, o_t) corrects prior p(z | h_t)
Prior (dynamics only) → Posterior (update with observation)

Both systems maintain a belief state updated in two phases: predict (propagate forward using dynamics alone) then update (correct using new observation). The Kalman filter does this with linear Gaussians; the RSSM does it with neural networks and categorical distributions. The KL loss in the RSSM plays the exact role of the innovation covariance in the Kalman filter — it measures how surprised the system is by the observation.

Can you identify the RSSM's "Kalman gain" — the mechanism that controls how much the posterior shifts from the prior when a new observation arrives?

Checkpoint — Before you move on

Explain in your own words: why does imagination require BOTH the prior network AND the KL training loss? What would happen if you only had the posterior but no prior?

✓ Gate cleared

Model Answer

During imagination, you have NO observations — only the history of imagined states. The posterior q(z|h,o) requires an observation o, so you can't use it when dreaming. You need a prior p(z|h) that generates plausible latent states from history alone. But the prior is only useful if it generates states similar to what the posterior would produce with real data — which is exactly what the KL loss enforces. Without KL training, the prior generates random, unrealistic latents, and any policy trained in that imagination will fail in reality. The prior IS the imagination engine; the KL loss IS the quality control that keeps imagination grounded in reality.

Chapter 3: Dreamer (v1 → v3)

Dreamer is the most successful family of world-model agents. The architecture has three components: a world model (RSSM), an actor (policy), and a critic (value function). The key innovation: the actor and critic are trained entirely inside the world model's imagination.

The RSSM: Dreamer's Brain

The core of Dreamer is the Recurrent State-Space Model (RSSM). Unlike a simple latent vector, the RSSM splits state into two parts:

Deterministic state h_t
GRU hidden state: [200] dims
Carries persistent history — the "memory" of what happened. Updated every step via the GRU recurrence.

Stochastic state z_t
Categorical: 32 classes × 32 dims = [1024]
Captures uncertainty about the current state. Sampled from a learned distribution using straight-through gradients.

Total latent state: h_t [200] + z_t [1024] = 1,224 dimensions. The split matters: h carries deterministic dynamics (ball moving along a trajectory), z captures stochastic branching (which direction the ball bounces).

Prior (imagination)

h_t = GRU(h_t-1, z_t-1, a_t-1)
ẑ_t ~ prior_net(h_t) — what the model expects

Posterior (reality check)

z_t ~ posterior_net(h_t, encode(o_t)) — what actually happened

Why two distributions? The prior predicts z_t from history alone (for imagining). The posterior updates z_t using the real observation (for training). The KL divergence between them is a training loss: it forces the prior to get better at predicting, so imagination becomes more accurate over time.

The Dreamer Training Loop

1. Collect Real Experience

Run policy in environment, store (o_t, a_t, r_t) in replay buffer

↓

2. Train World Model

Encode o_t → z_t, run RSSM, minimize:
reconstruction loss + KL(posterior || prior) + reward prediction

↓

3. Imagine H=15 Steps

From real (h, z), unroll prior forward:
for t = 1..15: h' = GRU(h, z, a), z' ~ prior(h'), r' = reward_net(h', z')
Shape: [B × 15 × 1224] imagined trajectory

↓

4. Train Actor-Critic

Backprop through entire imagined rollout!
∇θ = d(return) / d(θ) through all 15 steps

↻ repeat

The data efficiency multiplier: Each real step generates 15 imagined steps of learning signal (H=15 default). This is why Dreamer needs 50× fewer environment interactions than model-free PPO — one real experience seeds 15 imagined ones, and backprop flows through the entire imagined trajectory.

Version	Year	Key Improvement
Dreamer v1	2020	Latent imagination + value estimation
Dreamer v2	2021	Discrete latents (32×32 categorical), KL balancing, Atari mastery
DreamerV3	2023	Symlog predictions, fixed hyperparams, works across domains unchanged

Dreamer's Imagination Rollout

The agent imagines future states from the current state. Teal = real states, orange = imagined futures, green = rewards predicted. Notice how opacity fades — the model is less certain further into the future.

Imagination horizon12

Latent Imagination: What Actually Happens Step by Step

Here's the exact computation for one imagination step. Starting from a real state (h₃, z₃) after 3 real environment steps:

pseudocode
# Current real state
h3 = [200]   # GRU hidden state from real experience
z3 = [1024]  # Stochastic state (32 classes × 32 dims)

# === Imagination step 1 ===
a3 = actor(h3, z3)              # Policy outputs action [action_dim]
h4 = GRU(h3, cat(z3, a3))      # Deterministic update: [200]
z4 ~ categorical(prior_net(h4)) # Sample stochastic: [1024]
r4 = reward_net(h4, z4)        # Predicted reward: scalar

# === Imagination step 2 ===
a4 = actor(h4, z4)
h5 = GRU(h4, cat(z4, a4))
z5 ~ categorical(prior_net(h5))
r5 = reward_net(h5, z5)

# ... repeat for H=15 total steps
# Then backprop ∇θ through ALL 15 steps!

The key trick: Because everything is differentiable (GRU, MLP, categorical with straight-through gradients), we can backpropagate the policy gradient through the entire imagined trajectory. The actor learns: "if I take action A at step 3, through a chain of 12 more imagined steps, I get this total reward." This is fundamentally different from model-free RL, which can only learn from actually-experienced rewards.

DreamerV3's breakthrough: It was the first algorithm to collect diamonds in Minecraft from scratch — a task requiring long-horizon planning, exploration, and tool crafting — all learned through imagination. Same hyperparameters on Atari, DMC, Minecraft, and Crafter. No tuning per domain.

Check: Where does Dreamer train its policy?

In the real environment In imagined trajectories generated by the world model From a fixed dataset of demonstrations

🔨 Derivation Why the Posterior Uses Observations (Not Just History) ▶ ✓ ATTEMPTED

The RSSM has two paths to estimate z_t: the prior p(z_t | h_t) which uses only the deterministic history, and the posterior q(z_t | h_t, o_t) which also sees the current observation.

Your task: Derive from Bayes' rule why the posterior must condition on o_t, and explain the information-theoretic reason the posterior is always better than the prior for training.

Training the world model requires computing ∇L_recon. If z_t is sampled from the prior (which ignores o_t), the sampled z might be far from what produced o_t, giving high-variance gradients. The posterior concentrates probability on z values that actually explain o_t, giving low-variance, informative gradients for all downstream networks.

log p(o_t) ≥ E_q[log p(o_t|z_t)] - KL(q || p). The tighter the posterior approximates the true posterior, the tighter this bound. Using the prior for training would give a loose bound — equivalent to maximizing a lower bound on a lower bound.

Full derivation:

1. Generative model: z_t ~ p(z|h_t), o_t ~ p(o|z_t,h_t). The prior captures what the model expects before seeing reality.

2. Bayes' rule: p(z_t|h_t,o_t) = p(o_t|z_t,h_t) · p(z_t|h_t) / p(o_t|h_t). The observation o_t sharpens the distribution over z — eliminating latent states that couldn't have produced this observation.

3. For training: We approximate this true posterior with q(z|h,o) = neural_net(h,encode(o)). The ELBO gives: log p(o) ≥ E_q[log p(o|z)] - KL(q||p). Using q (posterior) for training maximizes a tight bound. Using p (prior) would be sampling z randomly, hoping some explain o — astronomically wasteful.

4. Information gain: I(z_t; o_t | h_t) = KL(q || p) measures exactly how much information the observation provides about the latent. A large KL means the observation is highly informative — the prior was surprised.

The key insight: The posterior is the "teacher" that shows the prior what z values actually correspond to real observations. Training with the prior alone would be like studying for an exam without an answer key — you'd never know which of your guesses were correct.

🔨 Derivation The Dreaming Objective — Backprop Through Imagination ▶ ✓ ATTEMPTED

Dreamer trains its policy by backpropagating through H imagined steps. The actor π_θ selects actions, the world model produces next states, and a critic estimates values.

Your task: Write out the objective J(θ) that the actor maximizes during dreaming, and show why gradients can flow through the imagination (unlike model-free RL where they can't).

The actor maximizes the sum of imagined rewards plus a terminal value estimate: J(θ) = ∑_t=1^H γ^t-1 r̂_t + γ^H V(ŝ_H). Here r̂_t = reward_net(h_t, z_t) and V is the critic's value estimate at the end of the imagination horizon.

In model-free RL, the environment is a black box — you can't differentiate through physics. But in Dreamer, the "environment" IS the world model (a differentiable neural network). The chain is: θ → a_t → GRU → h_t+1 → prior → z_t+1 → reward_net → r_t+1. Every link is differentiable (categorical uses straight-through estimator).

∇_θ r_t+1 = (dr/dz)(dz/dh)(dh/da)(da/dθ). Each partial is a Jacobian of a neural network layer — all computable via autograd. For H steps, you just extend the chain: da/dθ at step 1 affects h at step 2, which affects z at step 2, which affects r at step 2, which affects a at step 2, which affects h at step 3...

Full derivation:

1. The objective: J(θ) = E_πθ[∑_t=1^H γ^t-1 r̂_t + γ^H V_ψ(h_H, z_H)]

2. The computation graph (one step):
a_t = π_θ(h_t, z_t)    [differentiable — policy is a neural net]
h_t+1 = GRU(h_t, [z_t, a_t])    [differentiable — GRU is a neural net]
z_t+1 ~ Cat(prior(h_t+1))    [differentiable via straight-through]
r_t+1 = reward_net(h_t+1, z_t+1)    [differentiable]

3. Gradient for step t through step t+k:
∇_θ r_t+k = (∂r/∂s_t+k) · (∏_i=0^k-1 ∂s_t+i+1/∂a_t+i · ∂a_t+i/∂s_t+i) · (∂a_t/∂θ)

4. Why model-free can't do this: In model-free RL, s_t+1 = env(s_t, a_t). The "env" function is the real world — no Jacobian exists. You must use REINFORCE (high variance) or temporal-difference (bootstrapping). Dreamer replaces env() with a differentiable neural net, so direct backprop works.

The key insight: Dreamer converts RL into supervised learning. The actor gets a dense, low-variance gradient signal by differentiating through the world model, exactly like training a classifier. This is why it's so much more sample-efficient than policy gradient methods that rely on noisy reward-weighted log-probabilities.

💥 Break-It Lab What Dies When You Remove RSSM Components? ▶ ✓ ATTEMPTED

The RSSM has three critical components working together: the observation model (encoder + decoder), the stochastic path (z_t sampling), and the reconstruction loss. Each prevents a specific failure mode. Toggle them off and watch what breaks.

Remove Observation Model ACTIVE

Failure mode: Imagination without grounding. Without observations correcting the latent state, the prior drifts freely. After a few steps, the imagined world bears no resemblance to reality. The agent's dreams become fantasies — it might imagine getting rewards that don't exist. Equivalent to a Kalman filter that never incorporates measurements: the prediction error grows without bound.

Remove Stochastic Path (z_t) ACTIVE

Failure mode: Deterministic collapse. With only the deterministic h_t, the model can only predict one future per state. At branching points (will the enemy go left or right?), it averages all possibilities into a blurry mean prediction. Rewards become averaged too — the agent can't distinguish between "50% chance of +10" and "certain +5". It becomes blind to risk and opportunity.

Remove Reconstruction Loss ACTIVE

Failure mode: Latent drift / information loss. Without reconstruction pressure, the encoder has no reason to preserve visual information in z. The latent space collapses — all observations map to similar z values. The dynamics model might predict perfectly (ẑ_t+1 ≈ z_t+1) but the latents carry no information about the actual state. The reward predictor and policy have nothing meaningful to work with.

💻 Build It Implement the RSSM Prior/Posterior Update Step ▶ ✓ ATTEMPTED

Implement one step of the RSSM that computes both the prior (for imagination) and posterior (for training). The GRU updates the deterministic state h, then two separate MLPs produce the prior and posterior distributions over z.

signature def rssm_step(prev_h, prev_z, action, observation, gru, prior_net, posterior_net, encoder): """ One RSSM step: compute deterministic state, prior, and posterior. Args: prev_h: [batch, 200] - previous GRU hidden state prev_z: [batch, 1024] - previous stochastic state (32 classes x 32 dims) action: [batch, action_dim] - action taken observation: [batch, 3, 64, 64] - current observation (or None during imagination) gru: GRUCell(input=1024+action_dim, hidden=200) prior_net: Linear(200, 32*32) - produces logits for prior posterior_net: Linear(200+embed_dim, 32*32) - produces logits for posterior encoder: CNN that maps [3,64,64] -> [embed_dim] Returns: h: [batch, 200] - new deterministic state prior_logits: [batch, 32, 32] - prior distribution parameters posterior_logits: [batch, 32, 32] or None - posterior (None during imagination) z: [batch, 1024] - sampled stochastic state (from posterior if available, else prior) """

Test case

h, prior_logits, post_logits, z = rssm_step(prev_h, prev_z, action, obs, gru, prior_net, post_net, enc)
assert h.shape == (B, 200)
assert prior_logits.shape == (B, 32, 32)
assert post_logits.shape == (B, 32, 32)
assert z.shape == (B, 1024) # 32*32 flattened one-hots

Use straight-through gradients: sample discrete one-hot from the categorical, but in the backward pass, pass gradients through the softmax probabilities. In PyTorch: z_onehot = F.one_hot(logits.argmax(-1), 32) for forward, z = z_onehot + probs - probs.detach() for gradient flow. Then flatten the 32×32 one-hots to get [batch, 1024].

python
def rssm_step(prev_h, prev_z, action, observation,
              gru, prior_net, posterior_net, encoder):
    # Step 1: Deterministic state update
    # GRU input = concat(prev_z, action)
    gru_input = torch.cat([prev_z, action], dim=-1)  # [B, 1024+act_dim]
    h = gru(gru_input, prev_h)  # [B, 200]

    # Step 2: Prior (what the model expects from dynamics alone)
    prior_logits = prior_net(h).reshape(-1, 32, 32)  # [B, 32, 32]

    # Step 3: Posterior (correct belief using real observation)
    if observation is not None:
        embed = encoder(observation)  # [B, embed_dim]
        post_input = torch.cat([h, embed], dim=-1)  # [B, 200+embed_dim]
        posterior_logits = posterior_net(post_input).reshape(-1, 32, 32)
        logits_to_sample = posterior_logits  # use posterior during training
    else:
        posterior_logits = None
        logits_to_sample = prior_logits  # use prior during imagination

    # Step 4: Sample z with straight-through gradients
    probs = F.softmax(logits_to_sample, dim=-1)  # [B, 32, 32]
    indices = torch.argmax(logits_to_sample, dim=-1)  # [B, 32]
    z_onehot = F.one_hot(indices, 32).float()  # [B, 32, 32]
    # Straight-through: forward uses discrete, backward uses continuous
    z_st = z_onehot + probs - probs.detach()  # [B, 32, 32]
    z = z_st.reshape(-1, 32 * 32)  # [B, 1024]

    return h, prior_logits, posterior_logits, z

Bonus challenge: Extend this to support KL balancing (DreamerV2): use a mixture of free-nats clipping and KL in both directions, α·KL(sg(post)||prior) + (1-α)·KL(post||sg(prior)) where sg = stop gradient.

🔗 Pattern Recognition

Latent Dynamics = A VAE at Every Timestep

Standard VAE

Encoder: x → q(z|x)
Decoder: z → p(x|z)
Loss: recon + KL(q(z|x) || p(z))

RSSM (per timestep)

Posterior: o_t, h_t → q(z_t|h_t,o_t)
Decoder: z_t, h_t → p(o_t|z_t,h_t)
Loss: recon + KL(posterior || prior) → VAE lesson

The RSSM is a conditional VAE at each timestep, where the conditioning context is the deterministic history h_t. The prior p(z) in a standard VAE is fixed (usually N(0,1)); in the RSSM, it's learned — the prior network predicts what z should be from dynamics alone. This is what makes the "dreaming" mode possible: the learned prior replaces the observation.

In a standard VAE, KL(q||p) regularizes toward a fixed prior. In the RSSM, what does KL(posterior||prior) actually measure about the world model's understanding?

Chapter 4: JEPA — Joint Embedding Predictive Architecture

Yann LeCun proposed JEPA as an alternative to generative world models. Instead of predicting what the next observation looks like (pixel reconstruction), JEPA predicts the abstract representation of the next state. This avoids wasting capacity on irrelevant details.

predict: embed(x_t+1) ≈ predictor(embed(x_t), a_t)

The key difference from autoencoders: JEPA never reconstructs pixels. Both the target and prediction live in embedding space. A VICReg or similar loss prevents the embeddings from collapsing to trivial solutions.

The Architecture: Three Networks

JEPA has a surprisingly simple implementation with three components that must be carefully balanced:

Context Encoder f_θ

Encodes the current observation x_t → z_t
ViT backbone, trained with gradients

↓

Predictor g_φ

MLP or small transformer: predicts ẑ_t+1 from z_t
Trained with gradients

match with ↓

Target Encoder f̄_θ

Encodes the actual next observation x_t+1 → z*_t+1
EMA update only: θ̄ ← τθ̄ + (1-τ)θ, τ=0.996

Why EMA (Exponential Moving Average)? The target encoder doesn't get gradients — it's a slowly-moving copy of the context encoder. This creates a stable prediction target that changes slowly enough for the predictor to learn, but fast enough to improve over time. Without this, both encoders could agree to output zeros — perfect match, zero learning. The asymmetry breaks this collapse.

The VICReg Loss: Three Terms, Each Essential

The loss that prevents collapse has three components. Each solves a different failure mode:

Term	What It Does	What Breaks Without It
Invariance	MSE between predicted ẑ_t+1 and target z*_t+1	Predictions wouldn't match reality
Variance	Forces std(z) ≥ 1 along each dimension	All embeddings collapse to a single point (constant output)
Covariance	Decorrelates embedding dimensions	All dimensions encode the same information (rank collapse)

L = λ_inv · ||ẑ - z*||² + λ_var · max(0, 1 - std(z)) + λ_cov · ||off_diag(cov(z))||²

Think of it as a force balance: Invariance pulls predicted and target embeddings together. Variance pushes embeddings apart (prevents collapse). Covariance spreads information across dimensions. All three must be active simultaneously — remove any one and the representation degrades.

JEPA vs Generative Prediction

Compare: generative models predict pixels (expensive, noisy), JEPA predicts embeddings (cheap, abstract).

V-JEPA: From Images to Video Understanding

V-JEPA (Video JEPA, Meta 2024) extends the idea to video. Instead of predicting the embedding of the next frame, it predicts the embedding of a masked spacetime region:

Input Video

16 frames, split into spacetime patches
90% of patches are masked (hidden from context encoder)

↓

Context Encoder + Predictor

Encode visible 10% patches, predict embeddings for masked 90%

match with ↓

Target Encoder (EMA)

Encode all patches (including masked ones) → target embeddings

The result: V-JEPA learns strong video representations without ever generating pixels. It understands motion, causality, and physical interactions — all from predicting abstract features. On video classification benchmarks, it matches or beats models that train on 10× more compute. This suggests that learning "what matters" is more efficient than learning "every pixel."

LeCun's vision: JEPA is a stepping stone toward autonomous machine intelligence. By learning abstract world models that capture what matters for planning (not pixel-perfect reconstruction), agents could reason at a human-like level of abstraction. The missing piece: connecting JEPA representations to action and reward for actual agent behavior.

Check: What does JEPA predict?

The exact next image pixel-by-pixel The reward for the next state The embedding (abstract representation) of the next state

Chapter 5: Video Prediction — Genie & UniSim

The latest world models don't just predict latent states — they generate entire videos of what will happen next. This is world modeling at scale: train on millions of internet videos and learn a general-purpose simulator of the visual world.

How Video Tokens Work

Video generation models (like Sora) extend image generation to the temporal dimension. The key architectural idea: treat video as a 3D grid of spacetime patches:

Raw Video

T frames × H × W × 3 (e.g., 16 × 256 × 256 × 3)

↓ 3D VAE encoder (compress spatially + temporally)

Latent Video

T/4 × H/8 × W/8 × d (e.g., 4 × 32 × 32 × 16)
= 65,536 latent tokens

↓ DiT (Diffusion Transformer)

Temporal Attention

Self-attention across time: each spatial position attends to itself across all frames.
Spatial attention across space: each frame is a standard image transformer.

↓ 3D VAE decoder

Generated Video

16 × 256 × 256 × 3 RGB frames

Why video is harder than images: A single 256×256 image has ~1,024 latent tokens. A 16-frame video has ~65,536 tokens. Self-attention over all of them costs 65,536² = 4.3 billion operations. This is why video models use factored attention: spatial-only and temporal-only attention blocks, never full 3D attention.

Model	Key Idea	Training Data	Architecture
Genie (DeepMind)	Learn actions from unlabeled video	Internet gameplay videos	VQ-VAE + masked transformer
UniSim (Google)	Universal simulator of visual experience	Internet video + images	Cascaded diffusion
Sora (OpenAI)	Diffusion transformer, implicit physics	Internet video	DiT with spacetime patches
Cosmos (NVIDIA)	World foundation model for physical AI	Driving + robotics video	Autoregressive + diffusion

Video World Model: Frame Prediction

Given the current frame and an action, the model predicts future frames. Watch how prediction quality degrades over longer horizons.

Prediction horizon4

From video model to world model: If a video model can predict what happens when you push a cup off a table (it falls), it has implicitly learned gravity. Video prediction at scale may be the path to general-purpose physical understanding. The question: can you act inside these models? Genie says yes — it learns an action space from pure video, making the generated world interactive.

Genie: Action Discovery from Unlabeled Video

Genie's key insight is radical: you don't need labeled actions to build an interactive world model. From millions of unlabeled gameplay videos, Genie automatically discovers a latent action space:

Two Consecutive Frames

frame_t: [256, 256, 3], frame_t+1: [256, 256, 3]

↓ latent action encoder

Discovered Action

a_t: one of 8 discrete actions (up/down/left/right/jump/...)
Learned entirely from visual differences between frames

↓ at inference: human provides discrete action

Generated Next Frame

frame_t+1 = model(frame_t, a_t) — playable world!

From passive video to interactive world: Genie trained on 200,000 hours of 2D platformer videos with zero action labels. At inference, a user can control the character by selecting actions. The model has learned physics (gravity, collisions), game logic (enemies, platforms), and visual consistency — all from pure observation.

Check: What makes Genie special compared to traditional world models?

It learns action representations from unlabeled video It uses a simpler architecture It only works in simulation

Chapter 6: Planning with World Models

Having a world model is only useful if you can use it to make decisions. Planning algorithms search through imagined futures to find the best action sequence. Major approaches:

Method	How It Plans	Used In
Random Shooting	Sample many action sequences, pick best	PETS
CEM (Cross-Entropy)	Iteratively refine action distribution	PETS, TD-MPC
MCTS (Tree Search)	Build a search tree of states	MuZero, EfficientZero
Backprop through model	Gradient-based trajectory optimization	Dreamer

CEM / MPPI: Concrete Shapes

The Cross-Entropy Method (CEM) is the workhorse planning algorithm. Here's exactly what happens at each decision step for a robot arm with 6 joints and a horizon of H=12 steps:

1. Sample N Action Sequences

Draw N=512 trajectories from Gaussian:
actions: [512, 12, 6] — 512 plans, each 12 steps, 6-dim action

↓

2. Imagine Each Plan

For each of 512 plans, roll out world model 12 steps:
states: [512, 12, latent_dim], rewards: [512, 12]
Total: 6,144 imagination steps in one batch

↓

3. Rank by Total Reward

Sum rewards per trajectory: [512] scores
Keep top K=64 (elite fraction = 12.5%)

↓

4. Refit Distribution

Update μ, σ from the 64 best trajectories
Repeat 3-5 iterations, then execute first action

The real cost of planning: At each decision step, CEM imagines 512 × 12 = 6,144 future states. Over 3 refinement iterations, that's ~18,000 world model forward passes per action. This is why the world model must be fast (latent space, not pixels) and why GPU parallelism matters — all 512 trajectories are batched in one forward pass.

Planning by Random Shooting

The agent imagines many possible futures (gray) and picks the one with the highest reward (green). Click to re-plan.

Samples30

Dreamer vs CEM: Different Planning Philosophies

Dreamer and CEM-based methods (like TD-MPC) represent fundamentally different approaches to using a world model:

Property	Dreamer (Backprop)	CEM (Sampling)
How it plans	Gradient of reward w.r.t. policy through imagined trajectory	Sample many action sequences, keep the best
Produces	An amortized policy (reusable)	A plan for right now (replan each step)
Cost at test time	One forward pass (cheap)	Thousands of forward passes (expensive)
Handles	Continuous + discrete actions	Continuous actions primarily
Weakness	Policy may be suboptimal in novel states	Expensive at decision time, action space must be small

Think of it this way: Dreamer is like studying for an exam — you practice (imagine) beforehand and perform quickly on test day. CEM is like having open-book access during the exam — slower, but you can reason about novel questions on the spot. In practice, TD-MPC2 combines both: an amortized policy gives an initial guess, then CEM refines it.

MuZero's triumph: DeepMind's MuZero learns a world model + uses MCTS planning to master Go, Chess, Shogi, and Atari — all with the same algorithm, no rules provided. It imagines game states and searches for the best move. The key insight: MuZero's world model doesn't predict observations at all — it only predicts rewards, values, and policy priors in latent space. This is enough for planning.

Check: How does CEM (Cross-Entropy Method) plan?

Iteratively refines a distribution over action sequences toward high-reward ones Builds a full game tree Uses gradient descent on actions

🏗 Design Challenge You're the Architect: Autonomous Driving World Model ▶ ✓ ATTEMPTED

Your team is building a world model for an autonomous vehicle. The system must predict 3 seconds into the future to enable safe planning. You have 10 cameras (surround view), LiDAR, and a 100ms planning budget per decision cycle. The model runs on an NVIDIA Orin (275 TOPS INT8, 32GB shared memory).

Input

10 cameras @ 1920×1080 + LiDAR point cloud (150K points)

Planning budget

100ms total (encode + imagine + plan)

Prediction horizon

3 seconds @ 10Hz = 30 future steps

Memory

32GB shared (also running perception, mapping, etc.)

Safety requirement

Must detect when predictions are unreliable

1. Latent size? 30 steps of rollout means compounding error. Larger latent = more info preserved but slower rollout. What dimensionality?

2. How do you fuse 10 cameras + LiDAR into a single latent state? BEV? Per-camera encoding? Token fusion?

3. Do you predict in pixel space (for visualization/debugging) or pure latent (for speed)? Or both with different models?

4. How do you handle the 30-step horizon within 100ms? Can you afford CEM with 512 samples × 30 steps?

5. How do you detect unreliable predictions for safety? Ensemble disagreement? Learned uncertainty?

Real-world solutions (NVIDIA Cosmos, Wayve GAIA-1, Tesla):

Latent size: ~256-512 dims for the state. BEV representation: 200×200 grid at 0.5m resolution with 64-128 channels — but compressed to ~512-dim summary for temporal dynamics. The full BEV is used for spatial reasoning; the compressed state for temporal rollout.

Sensor fusion: Bird's Eye View (BEV) is the winning approach. Each camera is projected into a shared BEV grid using LSS (Lift-Splat-Shoot) or BEVFormer-style cross-attention. LiDAR provides direct 3D supervision. This gives a unified spatial representation regardless of camera count.

Prediction space: Industry uses BOTH. A fast latent model for planning (runs in ~20ms for 30 steps), plus a slower video prediction model for visualization and validation. The latent model predicts occupancy, velocity fields, and agent trajectories — not pixels.

Planning budget: 512×30 CEM steps = 15,360 forward passes. At ~2μs per latent step on Orin, that's ~30ms — feasible! But most production systems use a learned policy (like Dreamer) for the initial 90% of situations, falling back to CEM only for complex scenarios. This gives a ~5ms typical + 50ms worst-case split.

Uncertainty: Ensemble of 5 world models. When they disagree on the predicted BEV occupancy by more than a threshold, the system flags "unreliable prediction" and reverts to a conservative emergency policy. This is epistemic uncertainty — knowing what you don't know.

⚔ Adversarial: Your CEM planner consistently chooses actions that look great in imagination but fail in reality. The world model's training loss is low. What's the most likely cause?

You've trained a world model with low reconstruction loss and low KL. CEM planning looks beautiful in imagined rollouts — predicted rewards are high. But when executed in the real environment, the plan fails at step 3-4. The model's one-step predictions are accurate.

The model needs more training data The planner is exploiting model inaccuracies — finding action sequences that get high reward in the model's errors, not in reality The action space is too small

Chapter 7: Open Problems

World models are powerful but far from solved. Key challenges remain:

Problem	Why It's Hard	Current Approaches
Compounding errors	Prediction errors accumulate over long rollouts	Shorter horizons, latent space, ensembles
Partial observability	The agent can't see everything	Recurrent state (RSSM), memory
Stochastic environments	Multiple futures are possible	Stochastic latents, discrete codes
Generalization	Transfer between environments	Foundation world models (Genie, Cosmos)
Computational cost	Planning is expensive at test time	Amortized policies, model distillation

Compounding Errors: The Numbers

If your world model has 2% per-step prediction error, how bad does it get over long horizons?

Horizon	Cumulative Error	Usable?
H = 5	~10%	Solid — plans are reliable
H = 15	~26%	Dreamer's sweet spot — noisy but useful
H = 30	~45%	Degraded — imagination diverges from reality
H = 50	~64%	Broken — model is hallucinating entire scenarios

This is why Dreamer uses H=15 by default. Beyond that, the imagined world is too different from reality for the policy to transfer. The error compounds roughly as 1 - (1 - ε)^H, so even a "good" model becomes unreliable if you dream too far ahead.

The open-world failure: Compounding error gets much worse when the agent enters situations the model hasn't seen. A Dreamer agent trained in one area of Minecraft will imagine reasonable physics for familiar terrain. Enter a new biome with different dynamics (water, lava, ice) and the imagination produces nonsense — leading to dangerous plans based on wrong predictions.

Compounding Error over Horizon

Watch how prediction error grows with each imagined step. The red band shows the uncertainty growing. Move the slider to see how model quality affects the usable horizon.

Model quality60%

The Open-World Problem

Compounding error gets dramatically worse when the agent encounters situations outside its training distribution. Consider a robot trained in a warehouse:

Scenario	Model Quality	Why
Picking known boxes	Excellent	Seen thousands of times in training
New box shape	Good	Similar dynamics, model generalizes
Wet floor	Poor	Different friction — model never saw this
Human walks through workspace	Terrible	Completely novel dynamic object — imagination hallucinates

The core challenge: In novel situations, the world model's predictions become unreliable, but it doesn't know they're unreliable. The agent confidently plans based on wrong predictions. Model epistemic uncertainty (knowing what you don't know) is the key missing piece. Some approaches: train an ensemble of models and disagree = uncertainty, or use Bayesian neural networks to estimate prediction confidence.

The fundamental tension: Long planning horizons give better decisions but accumulate more error. Short horizons are accurate but myopic. Dreamer's H=15 works. H=50 degrades. Making models accurate enough for H=100+ — enabling long-term strategic planning — is the central open problem in world modeling.

Check: Why do world model predictions degrade over long horizons?

Small prediction errors compound at each step The model forgets the initial state Longer horizons need more GPU memory

⚔ Adversarial: Your world model predicts perfectly for 5 steps then diverges catastrophically at step 6+. Training loss is low. What's failing?

You're training an RSSM world model on a robot manipulation task. During evaluation, you roll out the prior for imagination. Steps 1-5 are eerily accurate — predicted states match reality within 2% error. At step 6, the model suddenly produces nonsensical states that don't correspond to any physical reality. The training loss (reconstruction + KL + reward) converges to a low value.

The model is overfitting to the training data The reward predictor is inaccurate The training data only contains sequences of ~5 steps, so the prior has never been trained to stay coherent beyond that horizon

Approach	Prediction Space	Planning Method	Data Efficiency	Best For
Dreamer (RSSM)	Latent (1224-dim)	Backprop through model	Very high	Continuous control, Atari, Minecraft
MuZero	Latent (abstract)	MCTS tree search	High	Board games, discrete Atari
JEPA	Embedding space	Not yet integrated	N/A (repr. learning)	Video understanding, pretraining
Sora / Cosmos	Pixel space (video)	Not yet used for planning	Low (needs internet-scale data)	Video generation, simulation
TD-MPC2	Latent	CEM + value function	High	Robotics, multi-task control

Understand World
Models

Chapter 0: Imagining the Future

Why This Matters: The Cost of Real Experience

Chapter 1: Model-Based Reinforcement Learning

The Data Efficiency Gap

The World Model Function Signature

Three Ways to Use a World Model

Chapter 2: Latent Dynamics

Concrete Shapes: Latent World Model

What Gets Kept, What Gets Discarded?

Training the Latent World Model

Chapter 3: Dreamer (v1 → v3)

The RSSM: Dreamer's Brain

The Dreamer Training Loop

Latent Imagination: What Actually Happens Step by Step

Chapter 4: JEPA — Joint Embedding Predictive Architecture

The Architecture: Three Networks

The VICReg Loss: Three Terms, Each Essential

V-JEPA: From Images to Video Understanding

Chapter 5: Video Prediction — Genie & UniSim

How Video Tokens Work

Genie: Action Discovery from Unlabeled Video

Chapter 6: Planning with World Models

CEM / MPPI: Concrete Shapes

Dreamer vs CEM: Different Planning Philosophies

Chapter 7: Open Problems

Compounding Errors: The Numbers

The Open-World Problem

Chapter 8: The Big Picture

Method Comparison

Connections

Understand WorldModels

Chapter 0: Imagining the Future

Why This Matters: The Cost of Real Experience

Chapter 1: Model-Based Reinforcement Learning

The Data Efficiency Gap

The World Model Function Signature

Three Ways to Use a World Model

Chapter 2: Latent Dynamics

Concrete Shapes: Latent World Model

What Gets Kept, What Gets Discarded?

Training the Latent World Model

Chapter 3: Dreamer (v1 → v3)

The RSSM: Dreamer's Brain

The Dreamer Training Loop

Latent Imagination: What Actually Happens Step by Step

Chapter 4: JEPA — Joint Embedding Predictive Architecture

The Architecture: Three Networks

The VICReg Loss: Three Terms, Each Essential

V-JEPA: From Images to Video Understanding

Chapter 5: Video Prediction — Genie & UniSim

How Video Tokens Work

Genie: Action Discovery from Unlabeled Video

Chapter 6: Planning with World Models

CEM / MPPI: Concrete Shapes

Dreamer vs CEM: Different Planning Philosophies

Chapter 7: Open Problems

Compounding Errors: The Numbers

The Open-World Problem

Chapter 8: The Big Picture

Method Comparison

Connections

Understand World
Models