AI Architectures

Video Generation

How Sora-style models conjure coherent video from noise — by compressing video into a spacetime latent, chopping it into patches that span space and time, and letting a diffusion transformer denoise the whole clip at once.

Prerequisites: Diffusion turns noise into an image by denoising + A transformer processes a sequence of patches. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: The Flicker Problem

You have a great image generator. Obvious idea for video: generate the frames one at a time. Run it 30 times, play them in sequence. The result is a flickering nightmare — the dog’s fur changes texture every frame, a coffee cup teleports across the table, colors strobe. Each frame is individually plausible, but they don’t agree with each other. The hard part of video isn’t drawing one good frame; it’s drawing frames that are consistent through time.

Video adds a brutal new axis: temporal coherence. Objects must persist (object permanence), motion must be smooth and physical, lighting must stay stable. A per-frame generator can’t deliver this because it never sees the other frames. The Sora-style answer (OpenAI, 2024, and a whole family around it): treat a video as one object in space and time, and generate the entire clip together so the frames are coherent by construction.

The trap: “video is just images in a row, so a good image model is enough.” That gets you flicker. Coherence requires the model to attend across time while it generates — every patch of every frame influenced by every other. The breakthrough is treating the whole spacetime volume as the thing to denoise, not a stack of independent pictures.

Per-frame vs. spacetime generation

An object moving across frames. Per-frame (orange) places it inconsistently — it jitters and jumps. Spacetime (teal) keeps it smooth and persistent. Slide through the frames and watch the difference.

frame0

Why does generating video frame-by-frame with an image model produce flicker?

Image models are too low resolution Each frame is generated independently, so they don’t agree over time — no temporal coherence Video files compress too much

Chapter 1: Diffusion, Briefly

The engine is diffusion. Quick recap: take a clean image, gradually add noise until it’s pure static, and train a network to reverse one step of that — to predict and remove a little noise. At generation time, start from pure noise and denoise repeatedly; a coherent image emerges from the static. (For the full story, see the Diffusion lesson.)

Two refinements make it practical, and both carry over to video. First, latent diffusion: don’t denoise in raw pixel space (expensive) — first compress the image with an autoencoder into a small latent, denoise there, then decode back to pixels. Far cheaper. Second, the denoiser can be a transformer instead of a U-Net — the Diffusion Transformer (DiT): patchify the latent, run a transformer to predict the noise. Video generation is these two ideas, extended from a 2-D image to a 3-D spacetime volume.

Denoising: noise → image

Drag the denoising progress: pure noise (left) is gradually cleaned into a coherent picture (right). Video does this same march — but on a whole clip at once.

denoising progress0.00

What is “latent diffusion”?

Denoising directly in raw pixel space Compressing to a small latent with an autoencoder, denoising there, then decoding to pixels — much cheaper Hiding the model weights

Chapter 2: The Spacetime Latent

For video, the first step is a video autoencoder that compresses both space and time. An image autoencoder shrinks height and width. A video autoencoder also shrinks the number of frames — it merges nearby frames into a smaller set of latent “time-slices.” A clip of, say, 64 frames at high resolution becomes a compact 3-D grid of latents: fewer time-slices, much smaller spatial size, but more channels.

Why compress time too? Because adjacent video frames are enormously redundant — most of the pixels barely change frame to frame. Throwing away that temporal redundancy makes the latent dramatically smaller, so the expensive diffusion transformer operates on a manageable volume instead of millions of raw pixels per second. The decoder later expands the denoised latent back into full- resolution, full-framerate video.

Concept → realization: picture the video as a 3-D loaf — width, height, and time. The autoencoder shrinks the loaf in all three dimensions at once into a small latent loaf. All the generation happens on the small loaf; only at the very end is it baked back out to full pixels. Space and time are treated symmetrically — that symmetry is the whole design philosophy.

Compressing the spacetime volume

A raw video volume (left, many frames, high-res) is encoded into a small latent volume (right, fewer time-slices, smaller spatial grid). Drag the compression factor and watch the loaf shrink in space and time together.

compression0.50

Why does a video autoencoder compress the time dimension, not just space?

To make the video play faster Adjacent frames are highly redundant; compressing time shrinks the latent so the diffusion transformer can afford to process it Time can’t be represented otherwise

Chapter 3: Spacetime Patches

A transformer needs a sequence of tokens. So we chop the latent video volume into spacetime patches — little cubes that span a small region of space and a few latent time-slices (sometimes called “tubelets”). Each cube becomes one token. The whole clip is now a sequence of spacetime patches, exactly the way a Vision Transformer turns an image into a sequence of 2-D patches — just with a time dimension added.

This patch view buys a superpower: variable resolution, duration, and aspect ratio. A longer video is simply more patches; a higher-resolution one is more patches per frame; a portrait clip has a different patch layout than a landscape one. The same model handles all of them, because it just sees a sequence of patches of whatever length the clip produces. Sora trained on video and images of many shapes and lengths this way — an image is just a one-frame video, a single layer of patches.

Chopping spacetime into patch-cubes

The latent volume is divided into spacetime patches (cubes spanning space and a few frames). Each cube is one token. Drag the clip length — more time means more patches, same model. That’s how one model makes clips of any duration.

clip length0.50

What does turning video into spacetime patches enable?

Only fixed 256×256, 16-frame clips Variable resolution, duration, and aspect ratio — a clip is just a sequence of however many patches it produces Removing the need for a transformer

Chapter 4: The Diffusion Transformer

Now the denoiser. Take the noisy sequence of spacetime patches and run a transformer that predicts the noise to remove — a Diffusion Transformer, or DiT, scaled to video. Its self-attention runs over all the patches: across space within a frame, and crucially across time between frames. A patch of the dog’s ear in frame 5 can attend to the dog’s ear in frames 1 and 30.

That global spacetime attention is exactly what kills the flicker. Because every patch sees every other patch — including its past and future selves — the model generates a clip where the dog stays the same dog, the cup stays on the table, and motion flows smoothly. Coherence isn’t bolted on afterward; it’s a direct consequence of letting attention span time. The timestep (how noisy we are) and the text prompt are injected into every block to condition the denoising.

noisy spacetime patches

sequence of cubes + noise

↓ DiT: self-attention over space AND time

predicted noise

per patch, coherent across the whole clip

↓ subtract, repeat many steps

clean latent video

decode → pixels

Spacetime attention

Click a patch: lines show it attending to patches in the same frame (space) and in other frames (time). That cross-time attention is what makes objects persist. Per-frame models lack the horizontal (time) lines entirely.

What gives a video diffusion transformer its temporal coherence?

Generating frames in reverse order Self-attention that spans time — each patch attends to patches in other frames, so objects stay consistent A higher frame rate

Chapter 5: Coherence & Emergent Physics

Because the model denoises the whole clip with attention spanning time, it learns to enforce the rules that make video look real: object permanence (a thing that leaves frame and returns is the same thing), consistent motion (velocities are smooth, not teleporting), and stable identity (the character’s face doesn’t morph). None of these are hand-coded — they’re learned from watching enormous amounts of real video, where these rules always hold.

At sufficient scale, something striking emerges: the model begins to behave like a crude world model. It maintains rough 3-D consistency as the camera moves, occludes and re-reveals objects correctly, and even approximates simple physics — not because anyone programmed a physics engine, but because predicting realistic video requires an implicit grasp of how the world behaves. This is why video generation is seen as a path toward world models for planning and robotics, not just a content tool.

Why coherence comes “for free”: the training objective is “denoise real video.” Real video is temporally coherent and roughly physical. So to lower its loss, the model must internalize permanence, smooth motion, and 3-D structure — the coherence is a side effect of accurately modeling the data, the same way Whisper’s robustness fell out of diverse data.

Object permanence across frames

An object passes behind an occluder and re-emerges. The spacetime model (teal) keeps it the same object with consistent motion; a per-frame model (orange) loses it or spawns a different one. Step through and watch.

frame0

Why do world-model behaviors (object permanence, rough physics) emerge in large video generators?

A physics engine is built into the architecture Accurately modeling real video requires implicitly capturing permanence, motion, and 3-D structure — so they emerge from the denoising objective at scale They are scripted in post-processing

Chapter 6: Text Conditioning & Guidance

To generate “a corgi surfing a wave at sunset,” the prompt must steer the denoising. The text is encoded (by a language/text model) into embeddings, and those are injected into every DiT block — typically via cross-attention, so each spacetime patch can ask “what does the prompt say should be here?” The denoiser is now conditioned: it removes noise toward a clip matching the text.

A crucial trick is classifier-free guidance. During training the model sometimes sees the prompt and sometimes doesn’t. At generation, you run it both ways and extrapolate away from the unconditioned prediction toward the conditioned one — amplifying the prompt’s influence. A guidance scale dials this: low guidance = more diverse but loosely on-prompt; high guidance = tightly on-prompt but less varied (and, too high, over-saturated and unnatural). Sora-style systems also re-caption training videos with a vision-language model to get rich, detailed captions — better captions teach tighter text control.

Guidance scale: diversity vs. prompt adherence

Low guidance: varied but drifts off-prompt. High guidance: nails the prompt but collapses variety (and over-cooks past a point). Drag the scale to find the sweet spot.

guidance scale0.40

What does increasing the classifier-free guidance scale do?

Makes generation faster Strengthens adherence to the text prompt at the cost of diversity (and naturalness if pushed too far) Adds more frames

Chapter 7: Generating a Clip, Live (showcase)

The full pipeline: start with a pure-noise spacetime latent, run the diffusion transformer for several denoising steps (each one attending across space and time), and decode the cleaned latent into a little animation. Control the number of steps, the guidance, and the clip length (patch count). Watch coherent motion emerge from static.

Spacetime diffusion, end to end

Press Generate: a noisy spacetime grid is denoised step by step into a coherent moving scene (a ball arcs across frames). More steps = cleaner; higher guidance = more on-prompt; longer clip = more patches/cost. The readout shows steps and patch count.

denoising steps20

clip length (frames)10

Every piece is here: the noisy latent (compressed spacetime), the patch sequence, the denoising march, and the decode to pixels — producing motion that’s smooth because the denoiser saw the whole clip at once. That’s the difference between video generation and a flipbook of independent images.

Chapter 8: Scale, Limits & the Family

Compute is brutal: a spacetime volume is far bigger than an image, and attention over all patches is costly — video generation is one of the most expensive things in ML. Latent compression and efficient attention are what make it feasible at all.
Physics is approximate: the model fakes physics by imitation, so it still slips — objects can morph, hands gain fingers, glass “breaks” wrong, causality wobbles. It learned correlations of how video looks, not the actual laws.
Long videos are hard: coherence over many seconds/minutes strains attention and memory; long-form often needs extension/chunking schemes with their own seams.

The family: Sora (latent spacetime DiT, variable shape/length) is the scale flagship. Stable Video Diffusion and others animate from a start image. Earlier work (Imagen Video, Make-A-Video) used cascaded diffusion U-Nets with added temporal layers. The trend is clear — toward unified spacetime transformers and toward video models as world simulators, not just clip generators.

Cost grows with the spacetime volume

Generation cost vs. clip length — it climbs fast because patches grow with both resolution and duration, and attention is over all of them. Latent compression (teal) keeps it far below raw-pixel diffusion (orange).

clip length0.50

Why is latent compression essential for video diffusion?

It improves the color accuracy The spacetime volume is enormous; compressing it makes attention over all patches affordable It removes the need for text prompts

Chapter 9: Cheat Sheet & Connections

video

frames × H × W pixels

↓ video autoencoder (compress space + time)

spacetime latent

small 3-D loaf: fewer time-slices, smaller grid

↓ patchify into spacetime cubes

patch sequence

variable count → any resolution/duration/aspect

↓ DiT denoiser (attention over space + time) × steps

clean latent → decode

coherent video; text via cross-attn + guidance

Piece	Role
Video autoencoder	compress space & time → small latent
Spacetime patches	tokenize the latent; enable variable shape/length
Diffusion Transformer	denoise with attention across space AND time → coherence
Cross-attention + CFG	text conditioning and prompt strength
Scale	emergent world-model behavior (permanence, rough physics)

Keep exploring

→ Diffusion — the denoising engine underneath
→ Diffusion Transformers (DiT) — the transformer denoiser, in detail
→ Flow Matching — the modern, faster training objective
→ VAE / VQ-VAE — the autoencoder that builds the latent
→ World Models — where video generation is heading

“What I cannot create, I do not understand.” You just rebuilt Sora-style video generation: compress video into a spacetime latent, chop it into patches that span space and time, denoise the whole clip with a transformer whose attention crosses frames, and steer it with text. Coherence isn’t patched on — it’s what happens when you generate spacetime all at once.