The first foundation world model trained unsupervised on unlabeled internet video — learns controllable 2D world models that turn any single image into a playable interactive environment.
Imagine you want to build a system that understands how 2D worlds work — how characters jump, platforms move, enemies patrol. You want a model that can take a single image and turn it into a playable game. An interactive environment you can step through, frame by frame, with a controller.
Prior world models (Dreamer, IRIS, GameGAN) can do this, but they all share one crippling requirement: they need action-labeled data. Every frame must come paired with the action that produced it — "jump," "move left," "fire." This means you need either:
Meanwhile, the internet is overflowing with gameplay video. YouTube alone has millions of hours of platformer footage. But none of it comes with action labels. A viewer watching a Mario speedrun sees the character jump — but the video file contains no record of which button was pressed.
Genie's breakthrough: you don't need action labels at all. Actions can be discovered from video, unsupervised.
Think about it — if you watch a platformer video and see a character jump between two consecutive frames, the information about which action was taken is already encoded in the difference between those frames. The character was on the ground in frame t, and airborne in frame t+1. Something caused that transition. We don't need someone to tell us "the player pressed jump" — we can infer that some action happened, and learn to represent it as a discrete code.
Genie formalizes this with a Latent Action Model (LAM): a network that looks at consecutive frames and outputs a discrete action code from a small vocabulary (just 8 codes). The model is never told what these codes mean. It discovers the action space entirely from the statistics of frame-to-frame transitions in video data.
The magic: because latent actions are learned from video statistics, they end up corresponding to semantically meaningful directions of motion. One code consistently means "move right," another "jump," and so on — without any supervision telling the model what these actions should be.
Before the model can reason about dynamics, it needs a compact representation of video frames. Raw pixels are too high-dimensional — a 160x90 frame at 3 channels is 43,200 numbers. The video tokenizer compresses each frame into a small grid of discrete tokens.
The tokenizer is a VQ-VAE (Vector Quantized Variational Autoencoder). It takes T frames of video x1:T and produces discrete token grids z1:T:
Each frame gets mapped to D discrete tokens, where each token is an index into a codebook of 1,024 learned embeddings. The patch size is 4, so a 160x90 frame becomes a 40x23 = 920 token grid. This is a 47x compression from raw pixels.
Unlike prior work that uses spatial-only tokenizers (compress each frame independently), Genie uses a Spatiotemporal Transformer (ST-ViViT) in both the encoder and decoder. Each ST block has three layers:
The tokenizer uses 200M parameters, with a codebook of 1,024 codes and embedding dimension 32. It's trained with the standard VQ-VAE objective: minimize reconstruction error while keeping the codebook well-utilized.
This is Genie's most novel component — the part that makes unsupervised world modeling possible. The Latent Action Model (LAM) watches pairs of consecutive video frames and figures out what "action" caused the transition, without ever being told what actions exist.
The LAM takes raw pixel frames (not tokens) as input. It processes the entire video x1:T through an ST-Transformer and outputs a sequence of T-1 latent action codes ã1:T-1, one for each frame transition:
The action vocabulary |A| is deliberately tiny: just 8 codes. This constraint is crucial — it forces the model to discover only the most meaningful dimensions of variation between frames, and makes the resulting actions human-interpretable and playable.
The LAM is trained with a VQ-VAE objective. The encoder (the LAM itself) maps frame transitions to discrete codes. A decoder takes these codes along with past frames and tries to reconstruct the next frame. The decoder exists only to give the LAM a training signal — if the latent action faithfully encodes what changed between frames, the decoder can reconstruct the next frame accurately.
After training, the 8 latent action codes correspond to semantically meaningful behaviors — move left, move right, jump, idle — despite never being told these categories exist. Each code remains consistent across different input frames: action 3 always means "move right" regardless of the character or level.
Click different action buttons to see how the LAM infers latent actions from frame transitions. Each action code captures a distinct direction of change.
The dynamics model is the "physics engine" of Genie. Given the current frame tokens and a latent action, it predicts what the next frame tokens should be. This is the component that actually generates the interactive environment at inference time.
The dynamics model is a decoder-only MaskGIT transformer. At each time step t, it takes:
And predicts the next frame tokens ẑt using a cross-entropy loss against the ground truth tokens zt.
A subtle but important design choice: the latent action is not concatenated to the frame tokens (as in prior world models). Instead, the action embedding is added to the frame token embeddings:
This additive approach improves controllability compared to concatenation. The action modulates the existing representation rather than occupying separate dimensions that the model might learn to ignore.
During training, a random fraction (50-100%) of the target frame's tokens are masked, and the model learns to predict the masked tokens. At inference, all tokens of the next frame start masked, and the model fills them in over 25 iterative steps — each step unmasks the tokens the model is most confident about, then re-predicts the remaining masked tokens. This produces higher-quality frames than single-step prediction.
The final dynamics model has 10.1B parameters — the vast majority of Genie's total 10.7B. Training uses 256 TPUv5p chips for 125K steps with a batch size of 512, consuming 942B tokens.
Genie is trained on a massive dataset of publicly available internet videos of 2D platformer games — no action labels, no reward signals, no environment APIs. Just raw gameplay footage scraped from the web.
The dataset is constructed by filtering public videos for keywords related to platformer games. The pipeline yields:
The three components are trained separately:
All components use sequence length 16 (16 frames at 10 FPS = 1.6 seconds of context). The dynamics model uses bfloat16 precision and QK normalization for training stability at scale.
To demonstrate generality, the authors also train a 2.5B-parameter Genie on robotics datasets (RT-1 demonstrations + simulation data + prior robot data — ~130K demonstrations). Again, no action labels are used — just the video stream. The model learns consistent actions (move arm left, up, down) from robotic manipulation footage alone.
At inference time, Genie turns a single image into a playable interactive environment. Here's exactly how it works, step by step.
Click action buttons to step through the environment. Watch how the dynamics model generates each new frame from the current state + your chosen action.
Genie exhibits several emergent capabilities never explicitly trained for:
Genie is evaluated on two axes: how good do the generated videos look (fidelity), and how much do user actions actually control what happens (controllability)?
FVD measures how similar the distribution of generated videos is to real videos (lower is better). On the Platformers test set, the 11B Genie achieves strong FVD scores. On Robotics, a 2.5B model achieves FVD of 82.7.
This is a custom metric the authors designed. The idea: generate two videos from the same starting frame. One uses actions inferred from a ground-truth video (so it should reproduce that video). The other uses random actions. If actions actually matter, these two videos should diverge significantly:
Where x̂t is generated with ground-truth-inferred actions and x̂'t with random actions. Higher ΔtPSNR means actions have more effect — the model is more controllable.
The dynamics model shows clean scaling behavior:
Three tokenizer architectures compared at similar parameter count:
Genie is not just a video generator — the authors position it as a foundation world model. The learned dynamics and action space can bootstrap RL agents, even in environments the model has never seen.
The experiment: can an RL agent learn useful behaviors by practicing inside Genie-generated environments, then transfer to a real game?
Traditional agent training requires a simulator — you can't train an RL agent to play a game without the game itself. Genie opens a different path: train the agent in a generated world. The foundation world model provides:
This is the vision of a generalist agent — an agent that can navigate any 2D world because it was trained in a model that has seen all of them.
Genie sits at a fascinating intersection of several research threads. Understanding these connections reveals why this paper matters beyond platformer games.
Ha & Schmidhuber (2018), Dreamer (Hafner et al., 2020-2023), IRIS (Micheli et al., 2023): prior world models learn dynamics from interaction, requiring action labels. Genie's contribution is eliminating this requirement entirely, scaling to internet-sized data.
Sora (OpenAI, 2024): generates stunning videos from text, but they're not interactive — the user can't control what happens frame-by-frame. Genie trades visual quality for controllability. A hybrid approach (Sora-quality rendering + Genie-style action conditioning) is a natural next step.
GameNGen (Google, 2024): generates real-time DOOM gameplay conditioned on actual game actions. Unlike Genie, it requires action labels from a specific game. Genie is more general (any platformer, any image) but lower fidelity.
The robotics experiments show Genie's approach isn't limited to games. Learning action spaces from robot manipulation video — without teleoperation labels — could eventually provide "imagination" for robots to plan in. UniSim (Yang et al., 2023) explores a similar direction but requires text labels.
The LAM is conceptually related to contrastive learning and self-supervised representation learning — discovering structure from unlabeled data. But instead of learning static features, it discovers the action space that governs temporal dynamics. This is a new category of self-supervised objective.
Genie demonstrates the foundation model paradigm for world models: train once on diverse data, then specialize with minimal labeled data. Just as GPT learns language from unlabeled text and then adapts with few-shot prompting, Genie learns world dynamics from unlabeled video and then adapts with ~200 labeled action samples.