Genie — Veanors

Chapter 0: The Problem

Imagine you want to build a system that understands how 2D worlds work — how characters jump, platforms move, enemies patrol. You want a model that can take a single image and turn it into a playable game. An interactive environment you can step through, frame by frame, with a controller.

Prior world models (Dreamer, IRIS, GameGAN) can do this, but they all share one crippling requirement: they need action-labeled data. Every frame must come paired with the action that produced it — "jump," "move left," "fire." This means you need either:

A simulator with an API that logs actions (limited to specific games you already have)
Human players who record their button presses alongside gameplay (expensive, small-scale)
RL agents playing in environments that report actions (again, need the simulator first)

Meanwhile, the internet is overflowing with gameplay video. YouTube alone has millions of hours of platformer footage. But none of it comes with action labels. A viewer watching a Mario speedrun sees the character jump — but the video file contains no record of which button was pressed.

The fundamental bottleneck: World models are data-hungry, but the richest source of world data — internet video — is completely unlabeled. No actions, no rewards, no state vectors. Just pixels over time. If we could learn world models from raw video alone, we'd unlock orders of magnitude more training data than any simulator can provide.

Why can't existing world models simply learn from internet gameplay videos?

They require action labels paired with each frame — internet video has pixels only, no record of which buttons were pressed Internet videos are too low resolution YouTube has a restrictive API

Chapter 1: The Key Insight

Genie's breakthrough: you don't need action labels at all. Actions can be discovered from video, unsupervised.

Think about it — if you watch a platformer video and see a character jump between two consecutive frames, the information about which action was taken is already encoded in the difference between those frames. The character was on the ground in frame t, and airborne in frame t+1. Something caused that transition. We don't need someone to tell us "the player pressed jump" — we can infer that some action happened, and learn to represent it as a discrete code.

Genie formalizes this with a Latent Action Model (LAM): a network that looks at consecutive frames and outputs a discrete action code from a small vocabulary (just 8 codes). The model is never told what these codes mean. It discovers the action space entirely from the statistics of frame-to-frame transitions in video data.

The core architecture — three components, one goal:

Video Tokenizer (VQ-VAE): compresses raw video frames into discrete spatial tokens
Latent Action Model: infers discrete action codes between consecutive frames — no labels needed
Dynamics Model: given the current frame tokens + a latent action, predicts the next frame tokens

At inference, the LAM is discarded. A human chooses actions from the learned 8-code vocabulary, and the dynamics model generates what happens next. A single image becomes a playable world.

The magic: because latent actions are learned from video statistics, they end up corresponding to semantically meaningful directions of motion. One code consistently means "move right," another "jump," and so on — without any supervision telling the model what these actions should be.

How does Genie discover actions without any labels?

The Latent Action Model infers discrete action codes from frame-to-frame transitions — the information about what happened is already encoded in pixel differences It uses optical flow to estimate motion vectors It trains a classifier on a small labeled dataset first, then transfers

Chapter 2: Video Tokenizer

Before the model can reason about dynamics, it needs a compact representation of video frames. Raw pixels are too high-dimensional — a 160x90 frame at 3 channels is 43,200 numbers. The video tokenizer compresses each frame into a small grid of discrete tokens.

VQ-VAE for Video

The tokenizer is a VQ-VAE (Vector Quantized Variational Autoencoder). It takes T frames of video x_1:T and produces discrete token grids z_1:T:

x_1:T = (x₁, x₂, ..., x_T) ∈ R^T×H×W×C → z_1:T = (z₁, z₂, ..., z_T) ∈ Z^T×D

Each frame gets mapped to D discrete tokens, where each token is an index into a codebook of 1,024 learned embeddings. The patch size is 4, so a 160x90 frame becomes a 40x23 = 920 token grid. This is a 47x compression from raw pixels.

The ST-Transformer Architecture

Unlike prior work that uses spatial-only tokenizers (compress each frame independently), Genie uses a Spatiotemporal Transformer (ST-ViViT) in both the encoder and decoder. Each ST block has three layers:

Spatial layer: self-attention over the H×W tokens within a single frame
Temporal layer: self-attention over the same spatial position across T frames (causal — each frame only sees past frames)
Feed-forward layer: standard MLP

Why spatiotemporal? A spatial-only tokenizer treats each frame in isolation — it can't capture motion or temporal context. The causal temporal attention means each token z_t encodes information from all previously seen frames x_1:t, not just the current frame. This produces higher-quality reconstructions and better downstream video generation. The cost scales linearly with the number of frames (unlike C-ViViT which scales quadratically).

The tokenizer uses 200M parameters, with a codebook of 1,024 codes and embedding dimension 32. It's trained with the standard VQ-VAE objective: minimize reconstruction error while keeping the codebook well-utilized.

Why does Genie use a spatiotemporal (ST-ViViT) tokenizer instead of a spatial-only one?

The causal temporal attention lets each token encode information from past frames, capturing motion context and improving reconstruction quality It uses fewer parameters Spatial-only tokenizers can't handle color images

Chapter 3: Latent Action Model

This is Genie's most novel component — the part that makes unsupervised world modeling possible. The Latent Action Model (LAM) watches pairs of consecutive video frames and figures out what "action" caused the transition, without ever being told what actions exist.

Architecture

The LAM takes raw pixel frames (not tokens) as input. It processes the entire video x_1:T through an ST-Transformer and outputs a sequence of T-1 latent action codes ã_1:T-1, one for each frame transition:

LAM(x_1:T) → ã_1:T-1 ∈ {0, 1, ..., |A|-1}^T-1

The action vocabulary |A| is deliberately tiny: just 8 codes. This constraint is crucial — it forces the model to discover only the most meaningful dimensions of variation between frames, and makes the resulting actions human-interpretable and playable.

Training via VQ-VAE

The LAM is trained with a VQ-VAE objective. The encoder (the LAM itself) maps frame transitions to discrete codes. A decoder takes these codes along with past frames and tries to reconstruct the next frame. The decoder exists only to give the LAM a training signal — if the latent action faithfully encodes what changed between frames, the decoder can reconstruct the next frame accurately.

Key design choice — pixel input, not token input: The LAM takes raw pixels, not the tokenizer's discrete tokens. This is counterintuitive (why not reuse the tokens?), but the paper shows it's critical. Tokenization loses some fine-grained motion information. The LAM operating on raw pixels captures dynamics and movement that the tokenizer's compression may smooth over. This gives 43% better controllability (Δ_tPSNR: 1.91 vs 1.33 on Platformers).

What gets discovered?

After training, the 8 latent action codes correspond to semantically meaningful behaviors — move left, move right, jump, idle — despite never being told these categories exist. Each code remains consistent across different input frames: action 3 always means "move right" regardless of the character or level.

Click different action buttons to see how the LAM infers latent actions from frame transitions. Each action code captures a distinct direction of change.

At inference, the LAM is discarded. The entire LAM encoder and decoder are thrown away. Only the VQ codebook survives. A human player picks an action code (0-7), that code is looked up in the codebook to get an embedding, and that embedding is fed to the dynamics model. The LAM was just a means to learn what the action space should look like.

Why is the LAM's VQ codebook limited to only 8 codes?

A tiny vocabulary forces the model to discover only the most meaningful action dimensions, and makes the actions human-playable Larger codebooks cause training instability 8 matches the number of buttons on a standard gamepad

Chapter 4: Dynamics Model

The dynamics model is the "physics engine" of Genie. Given the current frame tokens and a latent action, it predicts what the next frame tokens should be. This is the component that actually generates the interactive environment at inference time.

MaskGIT Transformer

The dynamics model is a decoder-only MaskGIT transformer. At each time step t, it takes:

Tokenized video frames z_1:t-1 (from the video tokenizer)
Latent actions ã_1:t-1 (from the LAM, with stop-gradient)

And predicts the next frame tokens ẑ_t using a cross-entropy loss against the ground truth tokens z_t.

Action as Additive Embedding

A subtle but important design choice: the latent action is not concatenated to the frame tokens (as in prior world models). Instead, the action embedding is added to the frame token embeddings:

input_t = embed(z_t) + embed(ã_t)

This additive approach improves controllability compared to concatenation. The action modulates the existing representation rather than occupying separate dimensions that the model might learn to ignore.

MaskGIT Sampling

During training, a random fraction (50-100%) of the target frame's tokens are masked, and the model learns to predict the masked tokens. At inference, all tokens of the next frame start masked, and the model fills them in over 25 iterative steps — each step unmasks the tokens the model is most confident about, then re-predicts the remaining masked tokens. This produces higher-quality frames than single-step prediction.

ST-Transformer again: The dynamics model also uses the ST-Transformer architecture. The causal temporal attention means the model processes all T-1 frames and actions in parallel during training, predicting all T-1 next frames simultaneously. At inference, it runs autoregressively — generate frame 2, then use it to generate frame 3, and so on.

The final dynamics model has 10.1B parameters — the vast majority of Genie's total 10.7B. Training uses 256 TPUv5p chips for 125K steps with a batch size of 512, consuming 942B tokens.

How does the dynamics model incorporate the latent action?

The action embedding is added to the frame token embeddings — this additive approach improves controllability versus concatenation The action is concatenated as extra tokens in the sequence The action conditions a separate cross-attention layer

Chapter 5: Training on Internet Video

Genie is trained on a massive dataset of publicly available internet videos of 2D platformer games — no action labels, no reward signals, no environment APIs. Just raw gameplay footage scraped from the web.

The Platformers Dataset

The dataset is constructed by filtering public videos for keywords related to platformer games. The pipeline yields:

6.8M video clips, each 16 seconds long
~30,000 hours of gameplay footage (within an order of magnitude of popular video datasets like HowTo100M)
10 FPS, 160x90 resolution
Diverse games: Mario, Sonic, Mega Man, indie platformers, and many more

Scale matters: This scale would be impossible with labeled data. Recording 30,000 hours of action-labeled gameplay would require hundreds of human players working for years. Genie sidesteps this entirely — the data is already on the internet, just waiting to be downloaded and learned from.

Training Details

The three components are trained separately:

Video Tokenizer (200M params): VQ-VAE objective, patch size 4, codebook of 1,024 codes with dimension 32
Latent Action Model (300M params): VQ-VAE objective, patch size 16, codebook of 8 codes with dimension 32
Dynamics Model (10.1B params): cross-entropy on masked token prediction, batch size 512, 125K steps on 256 TPUv5p

All components use sequence length 16 (16 frames at 10 FPS = 1.6 seconds of context). The dynamics model uses bfloat16 precision and QK normalization for training stability at scale.

Robotics Too

To demonstrate generality, the authors also train a 2.5B-parameter Genie on robotics datasets (RT-1 demonstrations + simulation data + prior robot data — ~130K demonstrations). Again, no action labels are used — just the video stream. The model learns consistent actions (move arm left, up, down) from robotic manipulation footage alone.

How large is the Platformers training dataset, and why would this be infeasible with traditional world model approaches?

6.8M clips (~30K hours) — labeling this much data with ground-truth actions would require hundreds of human players over years 100K clips — too many games to build simulators for 1M clips — the resolution is too low for supervised learning

Chapter 6: Controllable Generation

At inference time, Genie turns a single image into a playable interactive environment. Here's exactly how it works, step by step.

The Inference Pipeline

Prompt with an image: The user provides a starting image x₁ — a screenshot, a hand-drawn sketch, a text-to-image generation, even a real-world photo
Tokenize: The video tokenizer's encoder converts x₁ into discrete tokens z₁
Choose an action: The user picks a latent action a₁ ∈ {0, 1, ..., 7} (like pressing a button on a controller)
Look up the action embedding: The chosen action indexes into the LAM's VQ codebook to get an embedding vector
Predict next frame: The dynamics model takes z₁ + the action embedding and generates ẑ₂ via 25 MaskGIT sampling steps (temperature = 2)
Decode to pixels: The tokenizer's decoder converts ẑ₂ back to an image x̂₂
Repeat: The user sees the new frame, picks another action, and the process continues autoregressively

Any image becomes a game. The model generalizes to wildly out-of-distribution prompts. Sketch a stick figure on a napkin, photograph it, feed it to Genie — the model will animate it as a platformer character moving through a generated world. This works because the model learned the concept of platformer dynamics (gravity, horizontal movement, platforms), not specific pixel patterns from training games.

Click action buttons to step through the environment. Watch how the dynamics model generates each new frame from the current state + your chosen action.

Emergent Behaviors

Genie exhibits several emergent capabilities never explicitly trained for:

Parallax: foreground objects move faster than backgrounds, simulating depth
Object persistence: platforms and enemies remain consistent across frames
Physics: characters arc during jumps, fall with gravity, land on surfaces
Deformable objects: in robotics mode, the model simulates bags deforming when pushed

At inference time, which component of Genie is NOT used?

The LAM encoder and decoder — only the VQ codebook survives; the human player replaces the LAM by choosing action codes directly The video tokenizer — frames are processed as raw pixels The dynamics model — it's only needed during training

Chapter 7: Results

Genie is evaluated on two axes: how good do the generated videos look (fidelity), and how much do user actions actually control what happens (controllability)?

Fidelity: Frechet Video Distance (FVD)

FVD measures how similar the distribution of generated videos is to real videos (lower is better). On the Platformers test set, the 11B Genie achieves strong FVD scores. On Robotics, a 2.5B model achieves FVD of 82.7.

Controllability: Δ_tPSNR

This is a custom metric the authors designed. The idea: generate two videos from the same starting frame. One uses actions inferred from a ground-truth video (so it should reproduce that video). The other uses random actions. If actions actually matter, these two videos should diverge significantly:

Δ_tPSNR = PSNR(x_t, x̂_t) − PSNR(x_t, x̂'_t)

Where x̂_t is generated with ground-truth-inferred actions and x̂'_t with random actions. Higher Δ_tPSNR means actions have more effect — the model is more controllable.

Scaling Results

The dynamics model shows clean scaling behavior:

Model size: training loss decreases consistently from 41M to 2.7B parameters — each size increase gives proportional improvement
Batch size: increasing from 128 to 448 also improves performance at fixed model size
The final 10.1B model extrapolates this scaling curve with batch size 512

Tokenizer Ablation

Three tokenizer architectures compared at similar parameter count:

ViT (spatial-only): FVD 114.5, Δ_tPSNR 1.39
C-ViViT (full spatiotemporal): FVD 272.7, Δ_tPSNR 1.37 (overfits, quadratic cost)
ST-ViViT (Genie's): FVD 81.4, Δ_tPSNR 1.66 (best on both axes, linear cost)

ST-ViViT wins on both fidelity AND controllability while using less memory than C-ViViT. The causal temporal attention captures motion patterns that spatial-only ViT misses, but avoids the overfitting and quadratic cost of full spatiotemporal attention.

What does the Δ_tPSNR metric measure?

How much generated videos differ when using ground-truth-inferred actions versus random actions — higher means actions have more control over generation The peak signal-to-noise ratio of individual frames The difference in FVD between two model sizes

Chapter 8: Foundation World Model

Genie is not just a video generator — the authors position it as a foundation world model. The learned dynamics and action space can bootstrap RL agents, even in environments the model has never seen.

Training Agents in Genie Worlds

The experiment: can an RL agent learn useful behaviors by practicing inside Genie-generated environments, then transfer to a real game?

Latent action labeling: Feed expert gameplay videos from a target environment (CoinRun) through the frozen LAM to get latent action sequences
Behavioral cloning: Train a policy network that predicts which latent action the expert took, given an observation (image)
Action mapping: Use a small labeled dataset (~200 samples) to learn a mapping from latent actions to real game actions
Evaluation: Deploy the agent in CoinRun and measure how many levels it solves

Result: With just 200 labeled samples for the action mapping, the LAM-based agent matches the performance of an oracle behavioral cloning agent that had access to ground-truth expert actions from the start. The latent actions are consistent and meaningful enough that a trivial mapping suffices.

Why This Matters

Traditional agent training requires a simulator — you can't train an RL agent to play a game without the game itself. Genie opens a different path: train the agent in a generated world. The foundation world model provides:

Infinite environments: prompt with any image to create a new training level
Free exploration: the agent can take any action and see what happens, no simulator needed
Transfer: the latent action space is consistent enough to map to real actions with minimal supervision

This is the vision of a generalist agent — an agent that can navigate any 2D world because it was trained in a model that has seen all of them.

How many labeled samples does the LAM-based agent need to match an oracle behavioral cloning agent in CoinRun?

Just 200 labeled samples for the latent-to-real action mapping — the latent actions are already consistent and meaningful 10,000 labeled episodes in CoinRun Zero — it transfers without any labeled data

Chapter 9: Connections

Genie sits at a fascinating intersection of several research threads. Understanding these connections reveals why this paper matters beyond platformer games.

World Models

Ha & Schmidhuber (2018), Dreamer (Hafner et al., 2020-2023), IRIS (Micheli et al., 2023): prior world models learn dynamics from interaction, requiring action labels. Genie's contribution is eliminating this requirement entirely, scaling to internet-sized data.

Video Generation

Sora (OpenAI, 2024): generates stunning videos from text, but they're not interactive — the user can't control what happens frame-by-frame. Genie trades visual quality for controllability. A hybrid approach (Sora-quality rendering + Genie-style action conditioning) is a natural next step.

Game Generation

GameNGen (Google, 2024): generates real-time DOOM gameplay conditioned on actual game actions. Unlike Genie, it requires action labels from a specific game. Genie is more general (any platformer, any image) but lower fidelity.

Embodied AI & Robotics

The robotics experiments show Genie's approach isn't limited to games. Learning action spaces from robot manipulation video — without teleoperation labels — could eventually provide "imagination" for robots to plan in. UniSim (Yang et al., 2023) explores a similar direction but requires text labels.

Self-Supervised Learning

The LAM is conceptually related to contrastive learning and self-supervised representation learning — discovering structure from unlabeled data. But instead of learning static features, it discovers the action space that governs temporal dynamics. This is a new category of self-supervised objective.

Foundation Models

Genie demonstrates the foundation model paradigm for world models: train once on diverse data, then specialize with minimal labeled data. Just as GPT learns language from unlabeled text and then adapts with few-shot prompting, Genie learns world dynamics from unlabeled video and then adapts with ~200 labeled action samples.

The big picture: Genie proves that the internet itself can serve as a training environment for world models. No simulators, no APIs, no action labels — just video. As the approach scales to more diverse video data (3D games, real-world footage, multi-agent scenarios), it opens a path toward universal world models that understand how any environment works, learned purely from observation.

What is the fundamental difference between Genie and Sora as approaches to video modeling?

Sora generates beautiful but non-interactive videos from text prompts; Genie generates controllable frame-by-frame interactive environments from images — trading visual quality for user agency Sora uses transformers while Genie uses diffusion Sora is trained on more data

Genie: Generative Interactive Environments