Bruce, Dennis, Edwards, Parker-Holder, Shi et al. — ICML 2024 Best Paper

Genie: Generative Interactive Environments

The first foundation world model trained unsupervised on unlabeled internet video — learns controllable 2D world models that turn any single image into a playable interactive environment.

Prerequisites: VQ-VAE basics + Transformers + Video generation
10
Chapters
5+
Simulations

Chapter 0: The Problem

Imagine you want to build a system that understands how 2D worlds work — how characters jump, platforms move, enemies patrol. You want a model that can take a single image and turn it into a playable game. An interactive environment you can step through, frame by frame, with a controller.

Prior world models (Dreamer, IRIS, GameGAN) can do this, but they all share one crippling requirement: they need action-labeled data. Every frame must come paired with the action that produced it — "jump," "move left," "fire." This means you need either:

Meanwhile, the internet is overflowing with gameplay video. YouTube alone has millions of hours of platformer footage. But none of it comes with action labels. A viewer watching a Mario speedrun sees the character jump — but the video file contains no record of which button was pressed.

The fundamental bottleneck: World models are data-hungry, but the richest source of world data — internet video — is completely unlabeled. No actions, no rewards, no state vectors. Just pixels over time. If we could learn world models from raw video alone, we'd unlock orders of magnitude more training data than any simulator can provide.
Why can't existing world models simply learn from internet gameplay videos?

Chapter 1: The Key Insight

Genie's breakthrough: you don't need action labels at all. Actions can be discovered from video, unsupervised.

Think about it — if you watch a platformer video and see a character jump between two consecutive frames, the information about which action was taken is already encoded in the difference between those frames. The character was on the ground in frame t, and airborne in frame t+1. Something caused that transition. We don't need someone to tell us "the player pressed jump" — we can infer that some action happened, and learn to represent it as a discrete code.

Genie formalizes this with a Latent Action Model (LAM): a network that looks at consecutive frames and outputs a discrete action code from a small vocabulary (just 8 codes). The model is never told what these codes mean. It discovers the action space entirely from the statistics of frame-to-frame transitions in video data.

The core architecture — three components, one goal:
  1. Video Tokenizer (VQ-VAE): compresses raw video frames into discrete spatial tokens
  2. Latent Action Model: infers discrete action codes between consecutive frames — no labels needed
  3. Dynamics Model: given the current frame tokens + a latent action, predicts the next frame tokens
At inference, the LAM is discarded. A human chooses actions from the learned 8-code vocabulary, and the dynamics model generates what happens next. A single image becomes a playable world.

The magic: because latent actions are learned from video statistics, they end up corresponding to semantically meaningful directions of motion. One code consistently means "move right," another "jump," and so on — without any supervision telling the model what these actions should be.

How does Genie discover actions without any labels?

Chapter 2: Video Tokenizer

Before the model can reason about dynamics, it needs a compact representation of video frames. Raw pixels are too high-dimensional — a 160x90 frame at 3 channels is 43,200 numbers. The video tokenizer compresses each frame into a small grid of discrete tokens.

VQ-VAE for Video

The tokenizer is a VQ-VAE (Vector Quantized Variational Autoencoder). It takes T frames of video x1:T and produces discrete token grids z1:T:

x1:T = (x1, x2, ..., xT) ∈ RT×H×W×C → z1:T = (z1, z2, ..., zT) ∈ ZT×D

Each frame gets mapped to D discrete tokens, where each token is an index into a codebook of 1,024 learned embeddings. The patch size is 4, so a 160x90 frame becomes a 40x23 = 920 token grid. This is a 47x compression from raw pixels.

The ST-Transformer Architecture

Unlike prior work that uses spatial-only tokenizers (compress each frame independently), Genie uses a Spatiotemporal Transformer (ST-ViViT) in both the encoder and decoder. Each ST block has three layers:

  1. Spatial layer: self-attention over the H×W tokens within a single frame
  2. Temporal layer: self-attention over the same spatial position across T frames (causal — each frame only sees past frames)
  3. Feed-forward layer: standard MLP
Why spatiotemporal? A spatial-only tokenizer treats each frame in isolation — it can't capture motion or temporal context. The causal temporal attention means each token zt encodes information from all previously seen frames x1:t, not just the current frame. This produces higher-quality reconstructions and better downstream video generation. The cost scales linearly with the number of frames (unlike C-ViViT which scales quadratically).

The tokenizer uses 200M parameters, with a codebook of 1,024 codes and embedding dimension 32. It's trained with the standard VQ-VAE objective: minimize reconstruction error while keeping the codebook well-utilized.

Why does Genie use a spatiotemporal (ST-ViViT) tokenizer instead of a spatial-only one?

Chapter 3: Latent Action Model

This is Genie's most novel component — the part that makes unsupervised world modeling possible. The Latent Action Model (LAM) watches pairs of consecutive video frames and figures out what "action" caused the transition, without ever being told what actions exist.

Architecture

The LAM takes raw pixel frames (not tokens) as input. It processes the entire video x1:T through an ST-Transformer and outputs a sequence of T-1 latent action codes ã1:T-1, one for each frame transition:

LAM(x1:T) → ã1:T-1 ∈ {0, 1, ..., |A|-1}T-1

The action vocabulary |A| is deliberately tiny: just 8 codes. This constraint is crucial — it forces the model to discover only the most meaningful dimensions of variation between frames, and makes the resulting actions human-interpretable and playable.

Training via VQ-VAE

The LAM is trained with a VQ-VAE objective. The encoder (the LAM itself) maps frame transitions to discrete codes. A decoder takes these codes along with past frames and tries to reconstruct the next frame. The decoder exists only to give the LAM a training signal — if the latent action faithfully encodes what changed between frames, the decoder can reconstruct the next frame accurately.

Key design choice — pixel input, not token input: The LAM takes raw pixels, not the tokenizer's discrete tokens. This is counterintuitive (why not reuse the tokens?), but the paper shows it's critical. Tokenization loses some fine-grained motion information. The LAM operating on raw pixels captures dynamics and movement that the tokenizer's compression may smooth over. This gives 43% better controllability (ΔtPSNR: 1.91 vs 1.33 on Platformers).

What gets discovered?

After training, the 8 latent action codes correspond to semantically meaningful behaviors — move left, move right, jump, idle — despite never being told these categories exist. Each code remains consistent across different input frames: action 3 always means "move right" regardless of the character or level.

Click different action buttons to see how the LAM infers latent actions from frame transitions. Each action code captures a distinct direction of change.

At inference, the LAM is discarded. The entire LAM encoder and decoder are thrown away. Only the VQ codebook survives. A human player picks an action code (0-7), that code is looked up in the codebook to get an embedding, and that embedding is fed to the dynamics model. The LAM was just a means to learn what the action space should look like.
Why is the LAM's VQ codebook limited to only 8 codes?

Chapter 4: Dynamics Model

The dynamics model is the "physics engine" of Genie. Given the current frame tokens and a latent action, it predicts what the next frame tokens should be. This is the component that actually generates the interactive environment at inference time.

MaskGIT Transformer

The dynamics model is a decoder-only MaskGIT transformer. At each time step t, it takes:

And predicts the next frame tokens ẑt using a cross-entropy loss against the ground truth tokens zt.

Action as Additive Embedding

A subtle but important design choice: the latent action is not concatenated to the frame tokens (as in prior world models). Instead, the action embedding is added to the frame token embeddings:

inputt = embed(zt) + embed(ãt)

This additive approach improves controllability compared to concatenation. The action modulates the existing representation rather than occupying separate dimensions that the model might learn to ignore.

MaskGIT Sampling

During training, a random fraction (50-100%) of the target frame's tokens are masked, and the model learns to predict the masked tokens. At inference, all tokens of the next frame start masked, and the model fills them in over 25 iterative steps — each step unmasks the tokens the model is most confident about, then re-predicts the remaining masked tokens. This produces higher-quality frames than single-step prediction.

ST-Transformer again: The dynamics model also uses the ST-Transformer architecture. The causal temporal attention means the model processes all T-1 frames and actions in parallel during training, predicting all T-1 next frames simultaneously. At inference, it runs autoregressively — generate frame 2, then use it to generate frame 3, and so on.

The final dynamics model has 10.1B parameters — the vast majority of Genie's total 10.7B. Training uses 256 TPUv5p chips for 125K steps with a batch size of 512, consuming 942B tokens.

How does the dynamics model incorporate the latent action?

Chapter 5: Training on Internet Video

Genie is trained on a massive dataset of publicly available internet videos of 2D platformer games — no action labels, no reward signals, no environment APIs. Just raw gameplay footage scraped from the web.

The Platformers Dataset

The dataset is constructed by filtering public videos for keywords related to platformer games. The pipeline yields:

Scale matters: This scale would be impossible with labeled data. Recording 30,000 hours of action-labeled gameplay would require hundreds of human players working for years. Genie sidesteps this entirely — the data is already on the internet, just waiting to be downloaded and learned from.

Training Details

The three components are trained separately:

  1. Video Tokenizer (200M params): VQ-VAE objective, patch size 4, codebook of 1,024 codes with dimension 32
  2. Latent Action Model (300M params): VQ-VAE objective, patch size 16, codebook of 8 codes with dimension 32
  3. Dynamics Model (10.1B params): cross-entropy on masked token prediction, batch size 512, 125K steps on 256 TPUv5p

All components use sequence length 16 (16 frames at 10 FPS = 1.6 seconds of context). The dynamics model uses bfloat16 precision and QK normalization for training stability at scale.

Robotics Too

To demonstrate generality, the authors also train a 2.5B-parameter Genie on robotics datasets (RT-1 demonstrations + simulation data + prior robot data — ~130K demonstrations). Again, no action labels are used — just the video stream. The model learns consistent actions (move arm left, up, down) from robotic manipulation footage alone.

How large is the Platformers training dataset, and why would this be infeasible with traditional world model approaches?

Chapter 6: Controllable Generation

At inference time, Genie turns a single image into a playable interactive environment. Here's exactly how it works, step by step.

The Inference Pipeline

  1. Prompt with an image: The user provides a starting image x1 — a screenshot, a hand-drawn sketch, a text-to-image generation, even a real-world photo
  2. Tokenize: The video tokenizer's encoder converts x1 into discrete tokens z1
  3. Choose an action: The user picks a latent action a1 ∈ {0, 1, ..., 7} (like pressing a button on a controller)
  4. Look up the action embedding: The chosen action indexes into the LAM's VQ codebook to get an embedding vector
  5. Predict next frame: The dynamics model takes z1 + the action embedding and generates ẑ2 via 25 MaskGIT sampling steps (temperature = 2)
  6. Decode to pixels: The tokenizer's decoder converts ẑ2 back to an image x̂2
  7. Repeat: The user sees the new frame, picks another action, and the process continues autoregressively
Any image becomes a game. The model generalizes to wildly out-of-distribution prompts. Sketch a stick figure on a napkin, photograph it, feed it to Genie — the model will animate it as a platformer character moving through a generated world. This works because the model learned the concept of platformer dynamics (gravity, horizontal movement, platforms), not specific pixel patterns from training games.

Click action buttons to step through the environment. Watch how the dynamics model generates each new frame from the current state + your chosen action.

Emergent Behaviors

Genie exhibits several emergent capabilities never explicitly trained for:

At inference time, which component of Genie is NOT used?

Chapter 7: Results

Genie is evaluated on two axes: how good do the generated videos look (fidelity), and how much do user actions actually control what happens (controllability)?

Fidelity: Frechet Video Distance (FVD)

FVD measures how similar the distribution of generated videos is to real videos (lower is better). On the Platformers test set, the 11B Genie achieves strong FVD scores. On Robotics, a 2.5B model achieves FVD of 82.7.

Controllability: ΔtPSNR

This is a custom metric the authors designed. The idea: generate two videos from the same starting frame. One uses actions inferred from a ground-truth video (so it should reproduce that video). The other uses random actions. If actions actually matter, these two videos should diverge significantly:

ΔtPSNR = PSNR(xt, x̂t) − PSNR(xt, x̂'t)

Where x̂t is generated with ground-truth-inferred actions and x̂'t with random actions. Higher ΔtPSNR means actions have more effect — the model is more controllable.

Scaling Results

The dynamics model shows clean scaling behavior:

Tokenizer Ablation

Three tokenizer architectures compared at similar parameter count:

ST-ViViT wins on both fidelity AND controllability while using less memory than C-ViViT. The causal temporal attention captures motion patterns that spatial-only ViT misses, but avoids the overfitting and quadratic cost of full spatiotemporal attention.
What does the ΔtPSNR metric measure?

Chapter 8: Foundation World Model

Genie is not just a video generator — the authors position it as a foundation world model. The learned dynamics and action space can bootstrap RL agents, even in environments the model has never seen.

Training Agents in Genie Worlds

The experiment: can an RL agent learn useful behaviors by practicing inside Genie-generated environments, then transfer to a real game?

  1. Latent action labeling: Feed expert gameplay videos from a target environment (CoinRun) through the frozen LAM to get latent action sequences
  2. Behavioral cloning: Train a policy network that predicts which latent action the expert took, given an observation (image)
  3. Action mapping: Use a small labeled dataset (~200 samples) to learn a mapping from latent actions to real game actions
  4. Evaluation: Deploy the agent in CoinRun and measure how many levels it solves
Result: With just 200 labeled samples for the action mapping, the LAM-based agent matches the performance of an oracle behavioral cloning agent that had access to ground-truth expert actions from the start. The latent actions are consistent and meaningful enough that a trivial mapping suffices.

Why This Matters

Traditional agent training requires a simulator — you can't train an RL agent to play a game without the game itself. Genie opens a different path: train the agent in a generated world. The foundation world model provides:

This is the vision of a generalist agent — an agent that can navigate any 2D world because it was trained in a model that has seen all of them.

How many labeled samples does the LAM-based agent need to match an oracle behavioral cloning agent in CoinRun?

Chapter 9: Connections

Genie sits at a fascinating intersection of several research threads. Understanding these connections reveals why this paper matters beyond platformer games.

World Models

Ha & Schmidhuber (2018), Dreamer (Hafner et al., 2020-2023), IRIS (Micheli et al., 2023): prior world models learn dynamics from interaction, requiring action labels. Genie's contribution is eliminating this requirement entirely, scaling to internet-sized data.

Video Generation

Sora (OpenAI, 2024): generates stunning videos from text, but they're not interactive — the user can't control what happens frame-by-frame. Genie trades visual quality for controllability. A hybrid approach (Sora-quality rendering + Genie-style action conditioning) is a natural next step.

Game Generation

GameNGen (Google, 2024): generates real-time DOOM gameplay conditioned on actual game actions. Unlike Genie, it requires action labels from a specific game. Genie is more general (any platformer, any image) but lower fidelity.

Embodied AI & Robotics

The robotics experiments show Genie's approach isn't limited to games. Learning action spaces from robot manipulation video — without teleoperation labels — could eventually provide "imagination" for robots to plan in. UniSim (Yang et al., 2023) explores a similar direction but requires text labels.

Self-Supervised Learning

The LAM is conceptually related to contrastive learning and self-supervised representation learning — discovering structure from unlabeled data. But instead of learning static features, it discovers the action space that governs temporal dynamics. This is a new category of self-supervised objective.

Foundation Models

Genie demonstrates the foundation model paradigm for world models: train once on diverse data, then specialize with minimal labeled data. Just as GPT learns language from unlabeled text and then adapts with few-shot prompting, Genie learns world dynamics from unlabeled video and then adapts with ~200 labeled action samples.

The big picture: Genie proves that the internet itself can serve as a training environment for world models. No simulators, no APIs, no action labels — just video. As the approach scales to more diverse video data (3D games, real-world footage, multi-agent scenarios), it opens a path toward universal world models that understand how any environment works, learned purely from observation.
What is the fundamental difference between Genie and Sora as approaches to video modeling?