Yang, Teng, Zheng, Ding, Huang et al. — 2024

CogVideoX: Expert Transformer for Video

A diffusion transformer that generates 10-second, 768×1360 videos with coherent narratives — using a 3D VAE for 16× compression, expert adaptive LayerNorm for text-video fusion, and progressive training for quality at scale.

Prerequisites: Diffusion models + Transformers + VAEs
10
Chapters
5+
Simulations

Chapter 0: The Problem

Imagine you ask an AI to generate a video: "A bolt of lightning splits a rock, and a person jumps out from inside." The AI needs to produce dozens of frames that are individually photorealistic, temporally coherent (the rock doesn't teleport between frames), and semantically faithful to every detail in your prompt. That's three hard problems stacked on top of each other.

The compute challenge alone is staggering. A 10-second video at 16 fps and 768×1360 resolution is 160 frames × 768 × 1360 pixels × 3 channels ≈ 500 million values. Running a diffusion model in raw pixel space at that scale is intractable.

Previous approaches tried to tame the problem by splitting spatial and temporal processing. They'd apply 2D spatial attention within each frame, then 1D temporal attention across frames. This keeps compute manageable, but it introduces a fatal weakness: objects can't directly attend to their own position in adjacent frames. A person's head in frame i+1 can't see the head in frame i — visual information has to leak through background patches. The result? Flickering, inconsistent motion, and characters that morph between frames.

The three barriers: (1) Video data is 100× larger than image data — you need aggressive compression without destroying temporal continuity. (2) Text and video live in completely different feature spaces — naive concatenation doesn't align them. (3) Training on high-resolution, long-duration video from scratch is prohibitively expensive — you need curriculum strategies.
Why does separated spatial + temporal attention struggle with video generation?

Chapter 1: The Key Insight

CogVideoX solves all three barriers with a unified design philosophy: treat video as a first-class 3D signal, not a stack of 2D images.

The architecture has three interlocking components:

  1. 3D Causal VAE — compresses video jointly across space (4× each axis) and time (4×), giving a total 16× compression in token count. Unlike 2D VAEs that encode each frame independently (causing flicker), this 3D VAE shares information across frames during encoding, producing temporally smooth latents.
  2. Expert Transformer with Expert Adaptive LayerNorm — text and video tokens are concatenated into a single sequence, then processed through full 3D attention. But each modality gets its own LayerNorm parameters (the "expert" part), so the network can handle the radically different feature distributions without interference.
  3. Progressive Training — instead of training at full resolution from the start, CogVideoX trains at 256px first, then 512px, then 768px. At each stage it also increases video duration. This curriculum saves compute and builds robust representations bottom-up.
Why "expert"? In CogVideoX, "expert" doesn't mean a mixture-of-experts router. It means each modality (text and video) gets dedicated LayerNorm parameters that learn modality-specific scale and shift factors. The attention weights and FFN weights are shared — only the normalization is split. This is far more parameter-efficient than running two full transformers (like MMDiT does in Stable Diffusion 3).
Text prompt
T5 encoder → text embeddings ztext
Video frames
3D Causal VAE → compressed latents zvision
Concatenate
[ztext ; zvision] along sequence dimension
Expert Transformer
Full 3D attention + expert adaptive LayerNorm × N blocks
Decode
Unpatchify → 3D VAE decoder → video frames
What makes the CogVideoX transformer an "expert" transformer?

Chapter 2: The 3D VAE

A raw 10-second video at 16 fps and 768×1360 resolution has 160 frames. If you encode each frame independently with a standard 2D VAE (like SDXL's, which does 8× spatial compression per axis), you get 160 × 96 × 170 × 4 ≈ 10 million latent values. That's still enormous for a transformer to chew on.

CogVideoX's 3D Causal VAE compresses along all three dimensions: 8× on each spatial axis (same as SDXL) and 4× on the temporal axis. So those 160 frames become 40 temporal steps, and the total latent count drops to 40 × 96 × 170 × 16 ≈ 10 million — wait, that's the same? Not quite. The key difference is the quality: because temporal neighbors share information during 3D convolution, the latents encode smooth motion rather than independent snapshots. This eliminates the frame-to-frame flickering that plagues 2D VAE approaches.

Architecture

The encoder and decoder are symmetric: stages of ResNet blocks interleaved with downsampling (encoder) or upsampling (decoder) layers. Some stages perform 3D downsampling (spatial + temporal), while others do only 2D (spatial). The temporal compression uses causal convolution — all padding is placed at the beginning of the time axis, so the encoding at time t can only depend on frames at time ≤ t. This lets the VAE handle streaming and ensures the first frame can also encode images (single-frame "videos").

Causal convolution: Think of it like a one-way mirror in time. Frame 5 can see frames 1-5 but not frame 6. This means the VAE can start encoding before the entire video is available, and a single image is just a special case (one-frame video). All temporal padding goes to the left (past), never the right (future).

Ablation: How much compression?

The authors tested multiple compression settings:

VariantCompressionLatent Ch.Flickering ↓PSNR ↑
Baseline (SDXL 2D)8×8×1493.228.4
A8×8×4887.627.2
B (chosen)8×8×41686.328.7
C8×8×43287.730.5
D8×8×83287.829.0
E16×16×812887.327.9

Moving from 2D to 3D (Baseline → B) cuts flickering from 93.2 to 86.3 while improving PSNR. But pushing compression too far (variant E, 16×16×8) makes convergence extremely difficult — the model can't learn to reconstruct from such a compressed bottleneck.

The sweet spot: 8×8×4 with 16 latent channels. This gives a 4× reduction in temporal sequence length (critical for attention cost), minimal flicker, and strong reconstruction quality. The training uses L1 loss + LPIPS perceptual loss + KL loss, with a 3D discriminator GAN loss added after a few thousand steps.
3D VAE Compression Visualizer

See how 3D VAE compresses spatial and temporal dimensions. Drag the temporal slider to watch frames compress.

Temporal Compression 1x
Why does a 3D VAE reduce flickering compared to a 2D VAE?

Chapter 3: The Expert Transformer

Once the 3D VAE compresses the video into latent tokens zvision, and T5 encodes the text prompt into ztext, we need a transformer that can fuse both modalities deeply. CogVideoX's solution is elegantly simple: concatenate the two token sequences and run them through a single transformer stack with full 3D attention — but give each modality its own normalization parameters.

Why not just concatenate and go?

Text embeddings from T5 and video latents from the 3D VAE live in radically different feature spaces. Text embeddings might have values clustered around [-2, 2], while video latents might range [-0.5, 0.5]. If you apply a single LayerNorm to the concatenated sequence, the normalization statistics are dominated by whichever modality has higher variance, distorting the other.

Expert Adaptive LayerNorm

The solution: split the LayerNorm into two "experts." Following DiT's design, the diffusion timestep t drives an adaptive modulation (scale γ and shift β) via a small MLP. But instead of one set of (γ, β), there are two:

After normalization, the tokens are recombined and fed through shared attention and FFN layers. This means the heavy computation (attention, feedforward) is shared, while the lightweight normalization is modality-specific. The parameter overhead is minimal — just two extra sets of (γ, β) per layer.

Expert AdaLN vs. MMDiT: Stable Diffusion 3's MMDiT runs two completely separate transformer streams for text and image, doubling the parameter count. CogVideoX's Expert AdaLN achieves comparable or better alignment with a single shared transformer — only the normalization layers are split. In ablations, Expert AdaLN beat MMDiT (with the same parameter budget) on both FVD and CLIP4Clip scores.

3D Full Attention

With the concatenated [text ; video] sequence, CogVideoX applies full 3D attention: every token attends to every other token, regardless of modality, spatial position, or temporal position. This means a video patch at frame 50, pixel (100, 200) can directly attend to the text token "lightning" and to the video patch at frame 1, pixel (100, 200).

This is computationally expensive — the sequence length is (text tokens) + (T/4 × H/8 × W/8) — but FlashAttention makes it tractable. The payoff is enormous: objects maintain direct line-of-sight to themselves across time, eliminating the information-leaking problem of separated 2D+1D attention.

3D Rotary Position Embedding (3D-RoPE)

Each video latent lives at a 3D coordinate (x, y, t). CogVideoX extends standard RoPE to three dimensions by independently applying 1D-RoPE to each axis, allocating 3/8 of the hidden channels to x, 3/8 to y, and 2/8 to t. The results are concatenated along the channel dimension. This captures both spatial locality and temporal ordering with a single relative positional encoding.

How does CogVideoX's Expert AdaLN differ from MMDiT's approach?

Chapter 4: Text-Video Alignment

Getting a video model to faithfully follow a text prompt is harder than it sounds. The prompt "a cat wearing a tiny hat jumps over a sleeping dog" requires the model to bind attributes (tiny hat → cat, sleeping → dog), maintain object identity across frames, and orchestrate a temporal narrative (jumping is not standing). CogVideoX's alignment strategy has two parts: the text encoder and the fusion mechanism.

T5 Text Encoder

CogVideoX uses T5-XXL (an encoder-decoder language model with 4.7B parameters) to encode text prompts. Unlike CLIP text encoders used in Stable Diffusion (which are trained on short image captions), T5 was pretrained on diverse language tasks and handles complex, multi-sentence prompts much better. The text is encoded into a sequence of 226 tokens (padded/truncated to a fixed length) with 4096-dimensional embeddings.

Why T5 over CLIP? CLIP maxes out at 77 tokens and was trained with contrastive loss on (image, caption) pairs — it captures visual concepts well but struggles with compositional language ("a red ball on top of a blue cube"). T5 handles long, structured prompts because it was trained as a general-purpose language model.

Deep Fusion via Concatenation

The text embeddings are not injected via cross-attention (as in Stable Diffusion) or via a conditioning mechanism (as in DALL-E). Instead, they're concatenated directly with the video latent tokens along the sequence dimension. This means text tokens participate in every self-attention layer alongside video tokens — there's no information bottleneck.

In cross-attention designs, text information can only influence video through the K/V projections of dedicated cross-attention layers. In CogVideoX's full-attention design, text tokens and video tokens are peers — each text token can attend to every video token and vice versa, at every layer.

Why concatenation beats cross-attention: Cross-attention creates an asymmetry: video tokens attend to text (via cross-attn), but text doesn't attend to video. CogVideoX's approach is symmetric — text tokens can also "see" the video, which helps the model learn bidirectional alignment. The Expert AdaLN ensures this shared attention doesn't corrupt either modality's representations.

Video Captioning Pipeline

The quality of text-video alignment depends critically on the quality of training captions. Most web videos have short, vague descriptions ("funny cat video"). CogVideoX built a dense captioning pipeline:

  1. Generate short captions using the Panda70M model
  2. Extract key frames and generate dense image captions with CogVLM
  3. Use GPT-4 to summarize all frame-level captions into a coherent video description
  4. Fine-tune a LLaMA-2 on GPT-4 summaries for scalable captioning
  5. Finally, fine-tune an end-to-end model (CogVLM2-Caption) for fast inference

This pipeline produces detailed, temporally-aware captions that describe what happens, not just what's visible in a single frame.

Why does CogVideoX concatenate text and video tokens instead of using cross-attention?

Chapter 5: Progressive Training

Training a 5B-parameter video model at 768×1360 resolution from scratch would require astronomical compute. CogVideoX uses a progressive training curriculum that starts cheap and scales up gradually.

Resolution Progression

The model trains through multiple stages, each at increasing resolution:

StageResolutionWhat it learns
1256px (short side)Semantic understanding, basic motion, object categories
2512pxMid-frequency details, textures, finer motion
3768pxHigh-frequency details, sharp edges, realistic textures
4Fine-tuneHigh-quality curated data, aesthetic refinement

At each resolution, the aspect ratio is preserved (the short side is resized to the target, the long side scales proportionally). This means the model sees diverse aspect ratios throughout training — it doesn't just learn square crops.

Multi-Resolution Frame Pack

Training videos come in all lengths: 2 seconds, 5 seconds, 10 seconds. Fixed-duration training wastes data — you'd have to truncate long videos and discard short ones. CogVideoX's Frame Pack strategy (inspired by Patch'n Pack from NaViT) packs videos of different durations and resolutions into the same batch. Each video is patchified and positioned using 3D-RoPE, so the model naturally handles mixed shapes within a single forward pass.

Why progressive training works: Low-resolution training is 16× cheaper than high-resolution (half the pixels in each dimension = 4× fewer tokens, and you can use larger batches). The model learns semantic concepts (what a dog looks like, how running works) at 256px, then learns to render those concepts in high fidelity at 768px. Starting at high resolution from scratch wastes compute on learning semantics with expensive tokens.

Duration Progression

The 3D VAE is also trained progressively: first on 17-frame videos, then fine-tuned on 161-frame videos using context parallelism. The transformer follows a similar schedule — shorter clips first, then longer ones. This prevents the model from being overwhelmed by the quadratic attention cost of long sequences before it has learned basic visual representations.

Why is progressive training more efficient than training at full resolution from the start?

Chapter 6: Explicit Uniform Sampling

Diffusion models train by sampling a random timestep t for each data point, adding noise at that level, and learning to denoise. The standard training objective is:

L(θ) = Et, x0, ε [|| ε − εθ(√ᾱt x0 + √(1 − ᾱt) ε, t) ||2]

where t is uniformly sampled from [1, T]. In practice, each GPU rank independently samples a random t. With, say, 64 GPUs and T = 1000 timesteps, each batch only covers 64 random timesteps — and due to randomness, some ranges get oversampled while others get missed entirely.

The problem with random sampling

The loss magnitude varies dramatically across timesteps: near t = 0 (almost no noise), the loss is small. Near t = T (almost pure noise), the loss is large. When the sampled timesteps cluster unevenly, the aggregate loss fluctuates wildly between batches — not because the model is unstable, but because the timestep distribution is noisy. These fluctuations slow convergence and make learning rate tuning harder.

The fix: Explicit Uniform Sampling

CogVideoX divides the range [1, T] into n equal intervals (where n is the number of data-parallel ranks). Each rank samples uniformly within its assigned interval. Rank 0 samples from [1, T/n], rank 1 from [T/n + 1, 2T/n], and so on.

Simple but effective: This ensures every batch covers the full timestep range evenly. The loss curve becomes dramatically smoother, and in ablations, the loss at every timestep decreases faster — meaning Explicit Uniform Sampling doesn't just reduce variance, it actually accelerates convergence.
What does Explicit Uniform Sampling do differently from standard random timestep sampling?

Chapter 7: Data Pipeline

A video generation model is only as good as its training data. CogVideoX trains on approximately 35 million filtered video clips (averaging ~6 seconds each) plus 2 billion images from LAION-5B and COYO-700M. But raw web video is messy — full of screen recordings, lecture videos, heavily edited content, and clips with no meaningful motion. The data pipeline is CogVideoX's unsung hero.

Video Filtering

The team defined six categories of "negative" video that hurt training:

They manually labeled 20,000 videos across these categories, then trained 6 specialized classifiers (based on Video-LLaMA) to filter the entire dataset. Additionally, optical flow scores and aesthetic scores are computed for all videos — thresholds are dynamically adjusted during training to ensure clips have sufficient motion and visual quality.

Video Captioning Pipeline

The captioning pipeline described in Chapter 4 deserves emphasis here: going from a noisy web video with a title like "LOL CAT COMPILATION #47" to a dense description like "A grey tabby cat wearing a small red hat leaps over a sleeping golden retriever on a beige carpet, landing softly and turning to look at the camera" is what makes CogVideoX's text-following so strong.

Data quality > data quantity: The filtering pipeline removes roughly 80% of raw web video. The remaining 35M clips with dense captions give better results than training on the full, noisy corpus. This echoes a lesson from image generation: DALL-E 3's breakthrough was largely attributed to its recaptioning pipeline, not architectural changes.
Why does CogVideoX filter out "lecture type" videos from training?

Chapter 8: Results

CogVideoX was released in two sizes: 2B and 5B parameters. Both are text-to-video models with additional image-to-video variants. Let's see how they stack up.

Automated Benchmarks

CogVideoX-5B achieves state-of-the-art in 5 out of 7 VBench metrics, and competitive results in the remaining two. The metrics that matter most for video quality are Dynamic Degree (how much motion is in the video) and Dynamic Quality (motion quality without sacrificing visual fidelity). Many competing models "cheat" by generating near-static videos that score high on visual quality but low on dynamism.

ModelDynamic DegreeMultiple ObjectsGPT4o-MT
T2V-Turbo54.6524.42
AnimateDiff36.8822.422.62
VideoCrafter-2.040.6625.132.68
Gen-255.4719.342.62
Pika46.6921.892.48
CogVideoX-2B57.6824.373.09
CogVideoX-5B70.9524.443.36

CogVideoX-5B's Dynamic Degree of 70.95 crushes Gen-2 (55.47) and Pika (46.69) — it generates videos with significantly more motion, which is the whole point of video generation.

Human Evaluation

The team compared CogVideoX-5B against Kling (one of the best closed-source models at the time) across four dimensions:

AspectKlingCogVideoX-5B
Sensory Quality0.6380.722
Instruction Following0.3670.495
Physics Simulation0.5610.667
Cover Quality0.6680.712
Total2.172.74

CogVideoX-5B wins on every dimension, with particularly strong gains in instruction following (+35%) and physics simulation (+19%). This validates the Expert AdaLN's deep text-video fusion.

Scaling works: CogVideoX-2B is already competitive with most open and closed models. CogVideoX-5B is definitively better across all metrics. The architecture is designed to scale further — the authors note that performance consistently improves with model size, data volume, and training compute.
What metric best captures CogVideoX-5B's advantage over competitors?

Chapter 9: Connections

CogVideoX sits at the intersection of several major research threads in video generation. Here's how it connects to the broader landscape.

ModelRelation to CogVideoXKey Difference
Sora (OpenAI, 2024)Same family: DiT backbone for videoClosed-source, reportedly similar architecture but larger scale
Stable Video Diffusion (Stability, 2023)Uses 2D VAE + temporal fine-tuning2D VAE causes flicker; CogVideoX's 3D VAE eliminates it
DiT (Peebles & Xie, 2023)CogVideoX's backbone architectureDiT is for images; CogVideoX extends it to 3D with Expert AdaLN
AnimateDiff (Guo et al., 2023)Uses separated 2D+1D attentionCogVideoX uses full 3D attention for better motion coherence
Open-Sora (Zheng et al., 2024)Open-source DiT for videoCogVideoX has better 3D VAE and Expert AdaLN; higher quality
SD3 / MMDiT (Esser et al., 2024)Dual-stream text-image transformerMMDiT doubles parameters with two streams; Expert AdaLN shares weights
The open-source impact: CogVideoX was the first commercial-grade open-source video generation model at its scale. The release of the 2B and 5B checkpoints, along with the 3D VAE and captioning model, catalyzed a wave of follow-up work and community experimentation. It demonstrated that open models can compete with closed-source systems like Gen-2 and Pika on human preference.

What CogVideoX got right

Open questions

What is CogVideoX's key architectural advantage over Stable Video Diffusion?