A diffusion transformer that generates 10-second, 768×1360 videos with coherent narratives — using a 3D VAE for 16× compression, expert adaptive LayerNorm for text-video fusion, and progressive training for quality at scale.
Imagine you ask an AI to generate a video: "A bolt of lightning splits a rock, and a person jumps out from inside." The AI needs to produce dozens of frames that are individually photorealistic, temporally coherent (the rock doesn't teleport between frames), and semantically faithful to every detail in your prompt. That's three hard problems stacked on top of each other.
The compute challenge alone is staggering. A 10-second video at 16 fps and 768×1360 resolution is 160 frames × 768 × 1360 pixels × 3 channels ≈ 500 million values. Running a diffusion model in raw pixel space at that scale is intractable.
Previous approaches tried to tame the problem by splitting spatial and temporal processing. They'd apply 2D spatial attention within each frame, then 1D temporal attention across frames. This keeps compute manageable, but it introduces a fatal weakness: objects can't directly attend to their own position in adjacent frames. A person's head in frame i+1 can't see the head in frame i — visual information has to leak through background patches. The result? Flickering, inconsistent motion, and characters that morph between frames.
CogVideoX solves all three barriers with a unified design philosophy: treat video as a first-class 3D signal, not a stack of 2D images.
The architecture has three interlocking components:
A raw 10-second video at 16 fps and 768×1360 resolution has 160 frames. If you encode each frame independently with a standard 2D VAE (like SDXL's, which does 8× spatial compression per axis), you get 160 × 96 × 170 × 4 ≈ 10 million latent values. That's still enormous for a transformer to chew on.
CogVideoX's 3D Causal VAE compresses along all three dimensions: 8× on each spatial axis (same as SDXL) and 4× on the temporal axis. So those 160 frames become 40 temporal steps, and the total latent count drops to 40 × 96 × 170 × 16 ≈ 10 million — wait, that's the same? Not quite. The key difference is the quality: because temporal neighbors share information during 3D convolution, the latents encode smooth motion rather than independent snapshots. This eliminates the frame-to-frame flickering that plagues 2D VAE approaches.
The encoder and decoder are symmetric: stages of ResNet blocks interleaved with downsampling (encoder) or upsampling (decoder) layers. Some stages perform 3D downsampling (spatial + temporal), while others do only 2D (spatial). The temporal compression uses causal convolution — all padding is placed at the beginning of the time axis, so the encoding at time t can only depend on frames at time ≤ t. This lets the VAE handle streaming and ensures the first frame can also encode images (single-frame "videos").
The authors tested multiple compression settings:
| Variant | Compression | Latent Ch. | Flickering ↓ | PSNR ↑ |
|---|---|---|---|---|
| Baseline (SDXL 2D) | 8×8×1 | 4 | 93.2 | 28.4 |
| A | 8×8×4 | 8 | 87.6 | 27.2 |
| B (chosen) | 8×8×4 | 16 | 86.3 | 28.7 |
| C | 8×8×4 | 32 | 87.7 | 30.5 |
| D | 8×8×8 | 32 | 87.8 | 29.0 |
| E | 16×16×8 | 128 | 87.3 | 27.9 |
Moving from 2D to 3D (Baseline → B) cuts flickering from 93.2 to 86.3 while improving PSNR. But pushing compression too far (variant E, 16×16×8) makes convergence extremely difficult — the model can't learn to reconstruct from such a compressed bottleneck.
See how 3D VAE compresses spatial and temporal dimensions. Drag the temporal slider to watch frames compress.
Once the 3D VAE compresses the video into latent tokens zvision, and T5 encodes the text prompt into ztext, we need a transformer that can fuse both modalities deeply. CogVideoX's solution is elegantly simple: concatenate the two token sequences and run them through a single transformer stack with full 3D attention — but give each modality its own normalization parameters.
Text embeddings from T5 and video latents from the 3D VAE live in radically different feature spaces. Text embeddings might have values clustered around [-2, 2], while video latents might range [-0.5, 0.5]. If you apply a single LayerNorm to the concatenated sequence, the normalization statistics are dominated by whichever modality has higher variance, distorting the other.
The solution: split the LayerNorm into two "experts." Following DiT's design, the diffusion timestep t drives an adaptive modulation (scale γ and shift β) via a small MLP. But instead of one set of (γ, β), there are two:
After normalization, the tokens are recombined and fed through shared attention and FFN layers. This means the heavy computation (attention, feedforward) is shared, while the lightweight normalization is modality-specific. The parameter overhead is minimal — just two extra sets of (γ, β) per layer.
With the concatenated [text ; video] sequence, CogVideoX applies full 3D attention: every token attends to every other token, regardless of modality, spatial position, or temporal position. This means a video patch at frame 50, pixel (100, 200) can directly attend to the text token "lightning" and to the video patch at frame 1, pixel (100, 200).
This is computationally expensive — the sequence length is (text tokens) + (T/4 × H/8 × W/8) — but FlashAttention makes it tractable. The payoff is enormous: objects maintain direct line-of-sight to themselves across time, eliminating the information-leaking problem of separated 2D+1D attention.
Each video latent lives at a 3D coordinate (x, y, t). CogVideoX extends standard RoPE to three dimensions by independently applying 1D-RoPE to each axis, allocating 3/8 of the hidden channels to x, 3/8 to y, and 2/8 to t. The results are concatenated along the channel dimension. This captures both spatial locality and temporal ordering with a single relative positional encoding.
Getting a video model to faithfully follow a text prompt is harder than it sounds. The prompt "a cat wearing a tiny hat jumps over a sleeping dog" requires the model to bind attributes (tiny hat → cat, sleeping → dog), maintain object identity across frames, and orchestrate a temporal narrative (jumping is not standing). CogVideoX's alignment strategy has two parts: the text encoder and the fusion mechanism.
CogVideoX uses T5-XXL (an encoder-decoder language model with 4.7B parameters) to encode text prompts. Unlike CLIP text encoders used in Stable Diffusion (which are trained on short image captions), T5 was pretrained on diverse language tasks and handles complex, multi-sentence prompts much better. The text is encoded into a sequence of 226 tokens (padded/truncated to a fixed length) with 4096-dimensional embeddings.
Why T5 over CLIP? CLIP maxes out at 77 tokens and was trained with contrastive loss on (image, caption) pairs — it captures visual concepts well but struggles with compositional language ("a red ball on top of a blue cube"). T5 handles long, structured prompts because it was trained as a general-purpose language model.
The text embeddings are not injected via cross-attention (as in Stable Diffusion) or via a conditioning mechanism (as in DALL-E). Instead, they're concatenated directly with the video latent tokens along the sequence dimension. This means text tokens participate in every self-attention layer alongside video tokens — there's no information bottleneck.
In cross-attention designs, text information can only influence video through the K/V projections of dedicated cross-attention layers. In CogVideoX's full-attention design, text tokens and video tokens are peers — each text token can attend to every video token and vice versa, at every layer.
The quality of text-video alignment depends critically on the quality of training captions. Most web videos have short, vague descriptions ("funny cat video"). CogVideoX built a dense captioning pipeline:
This pipeline produces detailed, temporally-aware captions that describe what happens, not just what's visible in a single frame.
Training a 5B-parameter video model at 768×1360 resolution from scratch would require astronomical compute. CogVideoX uses a progressive training curriculum that starts cheap and scales up gradually.
The model trains through multiple stages, each at increasing resolution:
| Stage | Resolution | What it learns |
|---|---|---|
| 1 | 256px (short side) | Semantic understanding, basic motion, object categories |
| 2 | 512px | Mid-frequency details, textures, finer motion |
| 3 | 768px | High-frequency details, sharp edges, realistic textures |
| 4 | Fine-tune | High-quality curated data, aesthetic refinement |
At each resolution, the aspect ratio is preserved (the short side is resized to the target, the long side scales proportionally). This means the model sees diverse aspect ratios throughout training — it doesn't just learn square crops.
Training videos come in all lengths: 2 seconds, 5 seconds, 10 seconds. Fixed-duration training wastes data — you'd have to truncate long videos and discard short ones. CogVideoX's Frame Pack strategy (inspired by Patch'n Pack from NaViT) packs videos of different durations and resolutions into the same batch. Each video is patchified and positioned using 3D-RoPE, so the model naturally handles mixed shapes within a single forward pass.
The 3D VAE is also trained progressively: first on 17-frame videos, then fine-tuned on 161-frame videos using context parallelism. The transformer follows a similar schedule — shorter clips first, then longer ones. This prevents the model from being overwhelmed by the quadratic attention cost of long sequences before it has learned basic visual representations.
Diffusion models train by sampling a random timestep t for each data point, adding noise at that level, and learning to denoise. The standard training objective is:
where t is uniformly sampled from [1, T]. In practice, each GPU rank independently samples a random t. With, say, 64 GPUs and T = 1000 timesteps, each batch only covers 64 random timesteps — and due to randomness, some ranges get oversampled while others get missed entirely.
The loss magnitude varies dramatically across timesteps: near t = 0 (almost no noise), the loss is small. Near t = T (almost pure noise), the loss is large. When the sampled timesteps cluster unevenly, the aggregate loss fluctuates wildly between batches — not because the model is unstable, but because the timestep distribution is noisy. These fluctuations slow convergence and make learning rate tuning harder.
CogVideoX divides the range [1, T] into n equal intervals (where n is the number of data-parallel ranks). Each rank samples uniformly within its assigned interval. Rank 0 samples from [1, T/n], rank 1 from [T/n + 1, 2T/n], and so on.
A video generation model is only as good as its training data. CogVideoX trains on approximately 35 million filtered video clips (averaging ~6 seconds each) plus 2 billion images from LAION-5B and COYO-700M. But raw web video is messy — full of screen recordings, lecture videos, heavily edited content, and clips with no meaningful motion. The data pipeline is CogVideoX's unsung hero.
The team defined six categories of "negative" video that hurt training:
They manually labeled 20,000 videos across these categories, then trained 6 specialized classifiers (based on Video-LLaMA) to filter the entire dataset. Additionally, optical flow scores and aesthetic scores are computed for all videos — thresholds are dynamically adjusted during training to ensure clips have sufficient motion and visual quality.
The captioning pipeline described in Chapter 4 deserves emphasis here: going from a noisy web video with a title like "LOL CAT COMPILATION #47" to a dense description like "A grey tabby cat wearing a small red hat leaps over a sleeping golden retriever on a beige carpet, landing softly and turning to look at the camera" is what makes CogVideoX's text-following so strong.
CogVideoX was released in two sizes: 2B and 5B parameters. Both are text-to-video models with additional image-to-video variants. Let's see how they stack up.
CogVideoX-5B achieves state-of-the-art in 5 out of 7 VBench metrics, and competitive results in the remaining two. The metrics that matter most for video quality are Dynamic Degree (how much motion is in the video) and Dynamic Quality (motion quality without sacrificing visual fidelity). Many competing models "cheat" by generating near-static videos that score high on visual quality but low on dynamism.
| Model | Dynamic Degree | Multiple Objects | GPT4o-MT |
|---|---|---|---|
| T2V-Turbo | 54.65 | 24.42 | — |
| AnimateDiff | 36.88 | 22.42 | 2.62 |
| VideoCrafter-2.0 | 40.66 | 25.13 | 2.68 |
| Gen-2 | 55.47 | 19.34 | 2.62 |
| Pika | 46.69 | 21.89 | 2.48 |
| CogVideoX-2B | 57.68 | 24.37 | 3.09 |
| CogVideoX-5B | 70.95 | 24.44 | 3.36 |
CogVideoX-5B's Dynamic Degree of 70.95 crushes Gen-2 (55.47) and Pika (46.69) — it generates videos with significantly more motion, which is the whole point of video generation.
The team compared CogVideoX-5B against Kling (one of the best closed-source models at the time) across four dimensions:
| Aspect | Kling | CogVideoX-5B |
|---|---|---|
| Sensory Quality | 0.638 | 0.722 |
| Instruction Following | 0.367 | 0.495 |
| Physics Simulation | 0.561 | 0.667 |
| Cover Quality | 0.668 | 0.712 |
| Total | 2.17 | 2.74 |
CogVideoX-5B wins on every dimension, with particularly strong gains in instruction following (+35%) and physics simulation (+19%). This validates the Expert AdaLN's deep text-video fusion.
CogVideoX sits at the intersection of several major research threads in video generation. Here's how it connects to the broader landscape.
| Model | Relation to CogVideoX | Key Difference |
|---|---|---|
| Sora (OpenAI, 2024) | Same family: DiT backbone for video | Closed-source, reportedly similar architecture but larger scale |
| Stable Video Diffusion (Stability, 2023) | Uses 2D VAE + temporal fine-tuning | 2D VAE causes flicker; CogVideoX's 3D VAE eliminates it |
| DiT (Peebles & Xie, 2023) | CogVideoX's backbone architecture | DiT is for images; CogVideoX extends it to 3D with Expert AdaLN |
| AnimateDiff (Guo et al., 2023) | Uses separated 2D+1D attention | CogVideoX uses full 3D attention for better motion coherence |
| Open-Sora (Zheng et al., 2024) | Open-source DiT for video | CogVideoX has better 3D VAE and Expert AdaLN; higher quality |
| SD3 / MMDiT (Esser et al., 2024) | Dual-stream text-image transformer | MMDiT doubles parameters with two streams; Expert AdaLN shares weights |