How Sora-style models conjure coherent video from noise — by compressing video into a spacetime latent, chopping it into patches that span space and time, and letting a diffusion transformer denoise the whole clip at once.
You have a great image generator. Obvious idea for video: generate the frames one at a time. Run it 30 times, play them in sequence. The result is a flickering nightmare — the dog’s fur changes texture every frame, a coffee cup teleports across the table, colors strobe. Each frame is individually plausible, but they don’t agree with each other. The hard part of video isn’t drawing one good frame; it’s drawing frames that are consistent through time.
Video adds a brutal new axis: temporal coherence. Objects must persist (object permanence), motion must be smooth and physical, lighting must stay stable. A per-frame generator can’t deliver this because it never sees the other frames. The Sora-style answer (OpenAI, 2024, and a whole family around it): treat a video as one object in space and time, and generate the entire clip together so the frames are coherent by construction.
An object moving across frames. Per-frame (orange) places it inconsistently — it jitters and jumps. Spacetime (teal) keeps it smooth and persistent. Slide through the frames and watch the difference.
The engine is diffusion. Quick recap: take a clean image, gradually add noise until it’s pure static, and train a network to reverse one step of that — to predict and remove a little noise. At generation time, start from pure noise and denoise repeatedly; a coherent image emerges from the static. (For the full story, see the Diffusion lesson.)
Two refinements make it practical, and both carry over to video. First, latent diffusion: don’t denoise in raw pixel space (expensive) — first compress the image with an autoencoder into a small latent, denoise there, then decode back to pixels. Far cheaper. Second, the denoiser can be a transformer instead of a U-Net — the Diffusion Transformer (DiT): patchify the latent, run a transformer to predict the noise. Video generation is these two ideas, extended from a 2-D image to a 3-D spacetime volume.
Drag the denoising progress: pure noise (left) is gradually cleaned into a coherent picture (right). Video does this same march — but on a whole clip at once.
For video, the first step is a video autoencoder that compresses both space and time. An image autoencoder shrinks height and width. A video autoencoder also shrinks the number of frames — it merges nearby frames into a smaller set of latent “time-slices.” A clip of, say, 64 frames at high resolution becomes a compact 3-D grid of latents: fewer time-slices, much smaller spatial size, but more channels.
Why compress time too? Because adjacent video frames are enormously redundant — most of the pixels barely change frame to frame. Throwing away that temporal redundancy makes the latent dramatically smaller, so the expensive diffusion transformer operates on a manageable volume instead of millions of raw pixels per second. The decoder later expands the denoised latent back into full- resolution, full-framerate video.
A raw video volume (left, many frames, high-res) is encoded into a small latent volume (right, fewer time-slices, smaller spatial grid). Drag the compression factor and watch the loaf shrink in space and time together.
A transformer needs a sequence of tokens. So we chop the latent video volume into spacetime patches — little cubes that span a small region of space and a few latent time-slices (sometimes called “tubelets”). Each cube becomes one token. The whole clip is now a sequence of spacetime patches, exactly the way a Vision Transformer turns an image into a sequence of 2-D patches — just with a time dimension added.
This patch view buys a superpower: variable resolution, duration, and aspect ratio. A longer video is simply more patches; a higher-resolution one is more patches per frame; a portrait clip has a different patch layout than a landscape one. The same model handles all of them, because it just sees a sequence of patches of whatever length the clip produces. Sora trained on video and images of many shapes and lengths this way — an image is just a one-frame video, a single layer of patches.
The latent volume is divided into spacetime patches (cubes spanning space and a few frames). Each cube is one token. Drag the clip length — more time means more patches, same model. That’s how one model makes clips of any duration.
Now the denoiser. Take the noisy sequence of spacetime patches and run a transformer that predicts the noise to remove — a Diffusion Transformer, or DiT, scaled to video. Its self-attention runs over all the patches: across space within a frame, and crucially across time between frames. A patch of the dog’s ear in frame 5 can attend to the dog’s ear in frames 1 and 30.
That global spacetime attention is exactly what kills the flicker. Because every patch sees every other patch — including its past and future selves — the model generates a clip where the dog stays the same dog, the cup stays on the table, and motion flows smoothly. Coherence isn’t bolted on afterward; it’s a direct consequence of letting attention span time. The timestep (how noisy we are) and the text prompt are injected into every block to condition the denoising.
Click a patch: lines show it attending to patches in the same frame (space) and in other frames (time). That cross-time attention is what makes objects persist. Per-frame models lack the horizontal (time) lines entirely.
Because the model denoises the whole clip with attention spanning time, it learns to enforce the rules that make video look real: object permanence (a thing that leaves frame and returns is the same thing), consistent motion (velocities are smooth, not teleporting), and stable identity (the character’s face doesn’t morph). None of these are hand-coded — they’re learned from watching enormous amounts of real video, where these rules always hold.
At sufficient scale, something striking emerges: the model begins to behave like a crude world model. It maintains rough 3-D consistency as the camera moves, occludes and re-reveals objects correctly, and even approximates simple physics — not because anyone programmed a physics engine, but because predicting realistic video requires an implicit grasp of how the world behaves. This is why video generation is seen as a path toward world models for planning and robotics, not just a content tool.
An object passes behind an occluder and re-emerges. The spacetime model (teal) keeps it the same object with consistent motion; a per-frame model (orange) loses it or spawns a different one. Step through and watch.
To generate “a corgi surfing a wave at sunset,” the prompt must steer the denoising. The text is encoded (by a language/text model) into embeddings, and those are injected into every DiT block — typically via cross-attention, so each spacetime patch can ask “what does the prompt say should be here?” The denoiser is now conditioned: it removes noise toward a clip matching the text.
A crucial trick is classifier-free guidance. During training the model sometimes sees the prompt and sometimes doesn’t. At generation, you run it both ways and extrapolate away from the unconditioned prediction toward the conditioned one — amplifying the prompt’s influence. A guidance scale dials this: low guidance = more diverse but loosely on-prompt; high guidance = tightly on-prompt but less varied (and, too high, over-saturated and unnatural). Sora-style systems also re-caption training videos with a vision-language model to get rich, detailed captions — better captions teach tighter text control.
Low guidance: varied but drifts off-prompt. High guidance: nails the prompt but collapses variety (and over-cooks past a point). Drag the scale to find the sweet spot.
The full pipeline: start with a pure-noise spacetime latent, run the diffusion transformer for several denoising steps (each one attending across space and time), and decode the cleaned latent into a little animation. Control the number of steps, the guidance, and the clip length (patch count). Watch coherent motion emerge from static.
Press Generate: a noisy spacetime grid is denoised step by step into a coherent moving scene (a ball arcs across frames). More steps = cleaner; higher guidance = more on-prompt; longer clip = more patches/cost. The readout shows steps and patch count.
Every piece is here: the noisy latent (compressed spacetime), the patch sequence, the denoising march, and the decode to pixels — producing motion that’s smooth because the denoiser saw the whole clip at once. That’s the difference between video generation and a flipbook of independent images.
The family: Sora (latent spacetime DiT, variable shape/length) is the scale flagship. Stable Video Diffusion and others animate from a start image. Earlier work (Imagen Video, Make-A-Video) used cascaded diffusion U-Nets with added temporal layers. The trend is clear — toward unified spacetime transformers and toward video models as world simulators, not just clip generators.
Generation cost vs. clip length — it climbs fast because patches grow with both resolution and duration, and attention is over all of them. Latent compression (teal) keeps it far below raw-pixel diffusion (orange).
| Piece | Role |
|---|---|
| Video autoencoder | compress space & time → small latent |
| Spacetime patches | tokenize the latent; enable variable shape/length |
| Diffusion Transformer | denoise with attention across space AND time → coherence |
| Cross-attention + CFG | text conditioning and prompt strength |
| Scale | emergent world-model behavior (permanence, rough physics) |
→ Diffusion — the denoising engine underneath
→ Diffusion Transformers (DiT) — the transformer denoiser, in detail
→ Flow Matching — the modern, faster training objective
→ VAE / VQ-VAE — the autoencoder that builds the latent
→ World Models — where video generation is heading