CogVideoX — Veanors

Chapter 0: The Problem

Imagine you ask an AI to generate a video: "A bolt of lightning splits a rock, and a person jumps out from inside." The AI needs to produce dozens of frames that are individually photorealistic, temporally coherent (the rock doesn't teleport between frames), and semantically faithful to every detail in your prompt. That's three hard problems stacked on top of each other.

The compute challenge alone is staggering. A 10-second video at 16 fps and 768×1360 resolution is 160 frames × 768 × 1360 pixels × 3 channels ≈ 500 million values. Running a diffusion model in raw pixel space at that scale is intractable.

Previous approaches tried to tame the problem by splitting spatial and temporal processing. They'd apply 2D spatial attention within each frame, then 1D temporal attention across frames. This keeps compute manageable, but it introduces a fatal weakness: objects can't directly attend to their own position in adjacent frames. A person's head in frame i+1 can't see the head in frame i — visual information has to leak through background patches. The result? Flickering, inconsistent motion, and characters that morph between frames.

The three barriers: (1) Video data is 100× larger than image data — you need aggressive compression without destroying temporal continuity. (2) Text and video live in completely different feature spaces — naive concatenation doesn't align them. (3) Training on high-resolution, long-duration video from scratch is prohibitively expensive — you need curriculum strategies.

Why does separated spatial + temporal attention struggle with video generation?

Objects in one frame can't directly attend to the same object in adjacent frames — information must pass through background patches, causing flickering and inconsistency It uses too much memory It can only generate 2D images, not videos

Chapter 1: The Key Insight

CogVideoX solves all three barriers with a unified design philosophy: treat video as a first-class 3D signal, not a stack of 2D images.

The architecture has three interlocking components:

3D Causal VAE — compresses video jointly across space (4× each axis) and time (4×), giving a total 16× compression in token count. Unlike 2D VAEs that encode each frame independently (causing flicker), this 3D VAE shares information across frames during encoding, producing temporally smooth latents.
Expert Transformer with Expert Adaptive LayerNorm — text and video tokens are concatenated into a single sequence, then processed through full 3D attention. But each modality gets its own LayerNorm parameters (the "expert" part), so the network can handle the radically different feature distributions without interference.
Progressive Training — instead of training at full resolution from the start, CogVideoX trains at 256px first, then 512px, then 768px. At each stage it also increases video duration. This curriculum saves compute and builds robust representations bottom-up.

Why "expert"? In CogVideoX, "expert" doesn't mean a mixture-of-experts router. It means each modality (text and video) gets dedicated LayerNorm parameters that learn modality-specific scale and shift factors. The attention weights and FFN weights are shared — only the normalization is split. This is far more parameter-efficient than running two full transformers (like MMDiT does in Stable Diffusion 3).

Text prompt

T5 encoder → text embeddings z_text

↓

Video frames

3D Causal VAE → compressed latents z_vision

↓

Concatenate

[z_text ; z_vision] along sequence dimension

↓

Expert Transformer

Full 3D attention + expert adaptive LayerNorm × N blocks

↓

Decode

Unpatchify → 3D VAE decoder → video frames

What makes the CogVideoX transformer an "expert" transformer?

It uses a mixture-of-experts router to select different FFN blocks Each modality (text and video) gets its own dedicated LayerNorm parameters, while sharing the attention and FFN weights It has separate transformer stacks for text and video

Chapter 2: The 3D VAE

A raw 10-second video at 16 fps and 768×1360 resolution has 160 frames. If you encode each frame independently with a standard 2D VAE (like SDXL's, which does 8× spatial compression per axis), you get 160 × 96 × 170 × 4 ≈ 10 million latent values. That's still enormous for a transformer to chew on.

CogVideoX's 3D Causal VAE compresses along all three dimensions: 8× on each spatial axis (same as SDXL) and 4× on the temporal axis. So those 160 frames become 40 temporal steps, and the total latent count drops to 40 × 96 × 170 × 16 ≈ 10 million — wait, that's the same? Not quite. The key difference is the quality: because temporal neighbors share information during 3D convolution, the latents encode smooth motion rather than independent snapshots. This eliminates the frame-to-frame flickering that plagues 2D VAE approaches.

Architecture

The encoder and decoder are symmetric: stages of ResNet blocks interleaved with downsampling (encoder) or upsampling (decoder) layers. Some stages perform 3D downsampling (spatial + temporal), while others do only 2D (spatial). The temporal compression uses causal convolution — all padding is placed at the beginning of the time axis, so the encoding at time t can only depend on frames at time ≤ t. This lets the VAE handle streaming and ensures the first frame can also encode images (single-frame "videos").

Causal convolution: Think of it like a one-way mirror in time. Frame 5 can see frames 1-5 but not frame 6. This means the VAE can start encoding before the entire video is available, and a single image is just a special case (one-frame video). All temporal padding goes to the left (past), never the right (future).

Ablation: How much compression?

The authors tested multiple compression settings:

Variant	Compression	Latent Ch.	Flickering ↓	PSNR ↑
Baseline (SDXL 2D)	8×8×1	4	93.2	28.4
A	8×8×4	8	87.6	27.2
B (chosen)	8×8×4	16	86.3	28.7
C	8×8×4	32	87.7	30.5
D	8×8×8	32	87.8	29.0
E	16×16×8	128	87.3	27.9

Moving from 2D to 3D (Baseline → B) cuts flickering from 93.2 to 86.3 while improving PSNR. But pushing compression too far (variant E, 16×16×8) makes convergence extremely difficult — the model can't learn to reconstruct from such a compressed bottleneck.

The sweet spot: 8×8×4 with 16 latent channels. This gives a 4× reduction in temporal sequence length (critical for attention cost), minimal flicker, and strong reconstruction quality. The training uses L1 loss + LPIPS perceptual loss + KL loss, with a 3D discriminator GAN loss added after a few thousand steps.

3D VAE Compression Visualizer

See how 3D VAE compresses spatial and temporal dimensions. Drag the temporal slider to watch frames compress.

Temporal Compression 1x

Why does a 3D VAE reduce flickering compared to a 2D VAE?

The 3D convolutions share information across adjacent frames during encoding, producing temporally smooth latents instead of independent per-frame snapshots It compresses more aggressively, so there are fewer pixels to flicker It uses a higher frame rate

Chapter 3: The Expert Transformer

Once the 3D VAE compresses the video into latent tokens z_vision, and T5 encodes the text prompt into z_text, we need a transformer that can fuse both modalities deeply. CogVideoX's solution is elegantly simple: concatenate the two token sequences and run them through a single transformer stack with full 3D attention — but give each modality its own normalization parameters.

Why not just concatenate and go?

Text embeddings from T5 and video latents from the 3D VAE live in radically different feature spaces. Text embeddings might have values clustered around [-2, 2], while video latents might range [-0.5, 0.5]. If you apply a single LayerNorm to the concatenated sequence, the normalization statistics are dominated by whichever modality has higher variance, distorting the other.

Expert Adaptive LayerNorm

The solution: split the LayerNorm into two "experts." Following DiT's design, the diffusion timestep t drives an adaptive modulation (scale γ and shift β) via a small MLP. But instead of one set of (γ, β), there are two:

Vision Expert AdaLN — applies modulation to video hidden states only
Text Expert AdaLN — applies modulation to text hidden states only

After normalization, the tokens are recombined and fed through shared attention and FFN layers. This means the heavy computation (attention, feedforward) is shared, while the lightweight normalization is modality-specific. The parameter overhead is minimal — just two extra sets of (γ, β) per layer.

Expert AdaLN vs. MMDiT: Stable Diffusion 3's MMDiT runs two completely separate transformer streams for text and image, doubling the parameter count. CogVideoX's Expert AdaLN achieves comparable or better alignment with a single shared transformer — only the normalization layers are split. In ablations, Expert AdaLN beat MMDiT (with the same parameter budget) on both FVD and CLIP4Clip scores.

3D Full Attention

With the concatenated [text ; video] sequence, CogVideoX applies full 3D attention: every token attends to every other token, regardless of modality, spatial position, or temporal position. This means a video patch at frame 50, pixel (100, 200) can directly attend to the text token "lightning" and to the video patch at frame 1, pixel (100, 200).

This is computationally expensive — the sequence length is (text tokens) + (T/4 × H/8 × W/8) — but FlashAttention makes it tractable. The payoff is enormous: objects maintain direct line-of-sight to themselves across time, eliminating the information-leaking problem of separated 2D+1D attention.

3D Rotary Position Embedding (3D-RoPE)

Each video latent lives at a 3D coordinate (x, y, t). CogVideoX extends standard RoPE to three dimensions by independently applying 1D-RoPE to each axis, allocating 3/8 of the hidden channels to x, 3/8 to y, and 2/8 to t. The results are concatenated along the channel dimension. This captures both spatial locality and temporal ordering with a single relative positional encoding.

How does CogVideoX's Expert AdaLN differ from MMDiT's approach?

Expert AdaLN shares the attention and FFN weights across modalities but uses separate LayerNorm parameters, while MMDiT uses completely separate transformer streams — doubling parameters for comparable results Expert AdaLN uses more parameters because it has extra normalization layers Expert AdaLN only works with text, not video

Chapter 4: Text-Video Alignment

Getting a video model to faithfully follow a text prompt is harder than it sounds. The prompt "a cat wearing a tiny hat jumps over a sleeping dog" requires the model to bind attributes (tiny hat → cat, sleeping → dog), maintain object identity across frames, and orchestrate a temporal narrative (jumping is not standing). CogVideoX's alignment strategy has two parts: the text encoder and the fusion mechanism.

T5 Text Encoder

CogVideoX uses T5-XXL (an encoder-decoder language model with 4.7B parameters) to encode text prompts. Unlike CLIP text encoders used in Stable Diffusion (which are trained on short image captions), T5 was pretrained on diverse language tasks and handles complex, multi-sentence prompts much better. The text is encoded into a sequence of 226 tokens (padded/truncated to a fixed length) with 4096-dimensional embeddings.

Why T5 over CLIP? CLIP maxes out at 77 tokens and was trained with contrastive loss on (image, caption) pairs — it captures visual concepts well but struggles with compositional language ("a red ball on top of a blue cube"). T5 handles long, structured prompts because it was trained as a general-purpose language model.

Deep Fusion via Concatenation

The text embeddings are not injected via cross-attention (as in Stable Diffusion) or via a conditioning mechanism (as in DALL-E). Instead, they're concatenated directly with the video latent tokens along the sequence dimension. This means text tokens participate in every self-attention layer alongside video tokens — there's no information bottleneck.

In cross-attention designs, text information can only influence video through the K/V projections of dedicated cross-attention layers. In CogVideoX's full-attention design, text tokens and video tokens are peers — each text token can attend to every video token and vice versa, at every layer.

Why concatenation beats cross-attention: Cross-attention creates an asymmetry: video tokens attend to text (via cross-attn), but text doesn't attend to video. CogVideoX's approach is symmetric — text tokens can also "see" the video, which helps the model learn bidirectional alignment. The Expert AdaLN ensures this shared attention doesn't corrupt either modality's representations.

Video Captioning Pipeline

The quality of text-video alignment depends critically on the quality of training captions. Most web videos have short, vague descriptions ("funny cat video"). CogVideoX built a dense captioning pipeline:

Generate short captions using the Panda70M model
Extract key frames and generate dense image captions with CogVLM
Use GPT-4 to summarize all frame-level captions into a coherent video description
Fine-tune a LLaMA-2 on GPT-4 summaries for scalable captioning
Finally, fine-tune an end-to-end model (CogVLM2-Caption) for fast inference

This pipeline produces detailed, temporally-aware captions that describe what happens, not just what's visible in a single frame.

Why does CogVideoX concatenate text and video tokens instead of using cross-attention?

Cross-attention creates asymmetry — video attends to text but text doesn't attend to video. Concatenation + full attention gives symmetric, bidirectional alignment at every layer Cross-attention is slower than concatenation Concatenation reduces the model size

Chapter 5: Progressive Training

Training a 5B-parameter video model at 768×1360 resolution from scratch would require astronomical compute. CogVideoX uses a progressive training curriculum that starts cheap and scales up gradually.

Resolution Progression

The model trains through multiple stages, each at increasing resolution:

Stage	Resolution	What it learns
1	256px (short side)	Semantic understanding, basic motion, object categories
2	512px	Mid-frequency details, textures, finer motion
3	768px	High-frequency details, sharp edges, realistic textures
4	Fine-tune	High-quality curated data, aesthetic refinement

At each resolution, the aspect ratio is preserved (the short side is resized to the target, the long side scales proportionally). This means the model sees diverse aspect ratios throughout training — it doesn't just learn square crops.

Multi-Resolution Frame Pack

Training videos come in all lengths: 2 seconds, 5 seconds, 10 seconds. Fixed-duration training wastes data — you'd have to truncate long videos and discard short ones. CogVideoX's Frame Pack strategy (inspired by Patch'n Pack from NaViT) packs videos of different durations and resolutions into the same batch. Each video is patchified and positioned using 3D-RoPE, so the model naturally handles mixed shapes within a single forward pass.

Why progressive training works: Low-resolution training is 16× cheaper than high-resolution (half the pixels in each dimension = 4× fewer tokens, and you can use larger batches). The model learns semantic concepts (what a dog looks like, how running works) at 256px, then learns to render those concepts in high fidelity at 768px. Starting at high resolution from scratch wastes compute on learning semantics with expensive tokens.

Duration Progression

The 3D VAE is also trained progressively: first on 17-frame videos, then fine-tuned on 161-frame videos using context parallelism. The transformer follows a similar schedule — shorter clips first, then longer ones. This prevents the model from being overwhelmed by the quadratic attention cost of long sequences before it has learned basic visual representations.

Why is progressive training more efficient than training at full resolution from the start?

Low-resolution stages are much cheaper (fewer tokens, larger batches), and the model learns semantic concepts first before spending expensive compute on high-frequency details It uses less GPU memory It only trains for fewer steps total

Chapter 6: Explicit Uniform Sampling

Diffusion models train by sampling a random timestep t for each data point, adding noise at that level, and learning to denoise. The standard training objective is:

L(θ) = E_{t, x₀, ε} [|| ε − ε_θ(√ᾱ_t x₀ + √(1 − ᾱ_t) ε, t) ||²]

where t is uniformly sampled from [1, T]. In practice, each GPU rank independently samples a random t. With, say, 64 GPUs and T = 1000 timesteps, each batch only covers 64 random timesteps — and due to randomness, some ranges get oversampled while others get missed entirely.

The problem with random sampling

The loss magnitude varies dramatically across timesteps: near t = 0 (almost no noise), the loss is small. Near t = T (almost pure noise), the loss is large. When the sampled timesteps cluster unevenly, the aggregate loss fluctuates wildly between batches — not because the model is unstable, but because the timestep distribution is noisy. These fluctuations slow convergence and make learning rate tuning harder.

The fix: Explicit Uniform Sampling

CogVideoX divides the range [1, T] into n equal intervals (where n is the number of data-parallel ranks). Each rank samples uniformly within its assigned interval. Rank 0 samples from [1, T/n], rank 1 from [T/n + 1, 2T/n], and so on.

Simple but effective: This ensures every batch covers the full timestep range evenly. The loss curve becomes dramatically smoother, and in ablations, the loss at every timestep decreases faster — meaning Explicit Uniform Sampling doesn't just reduce variance, it actually accelerates convergence.

What does Explicit Uniform Sampling do differently from standard random timestep sampling?

It divides the timestep range into equal intervals, one per GPU rank, so each rank samples within its own interval — ensuring the batch covers all timesteps uniformly It always samples the same timestep It samples more from high-noise timesteps

Chapter 7: Data Pipeline

A video generation model is only as good as its training data. CogVideoX trains on approximately 35 million filtered video clips (averaging ~6 seconds each) plus 2 billion images from LAION-5B and COYO-700M. But raw web video is messy — full of screen recordings, lecture videos, heavily edited content, and clips with no meaningful motion. The data pipeline is CogVideoX's unsung hero.

Video Filtering

The team defined six categories of "negative" video that hurt training:

Editing artifacts — Videos with noticeable post-processing, special effects, or re-editing that distort natural dynamics
Lack of motion connectivity — Videos with hard cuts, spliced segments, or sequences made from static images
Low quality — Blurry, shaky, or poorly lit footage
Lecture type — A person talking with minimal motion (low visual diversity)
Text dominated — Videos with large text overlays or primarily textual content
Noisy screenshots — Screen recordings with poor quality

They manually labeled 20,000 videos across these categories, then trained 6 specialized classifiers (based on Video-LLaMA) to filter the entire dataset. Additionally, optical flow scores and aesthetic scores are computed for all videos — thresholds are dynamically adjusted during training to ensure clips have sufficient motion and visual quality.

Video Captioning Pipeline

The captioning pipeline described in Chapter 4 deserves emphasis here: going from a noisy web video with a title like "LOL CAT COMPILATION #47" to a dense description like "A grey tabby cat wearing a small red hat leaps over a sleeping golden retriever on a beige carpet, landing softly and turning to look at the camera" is what makes CogVideoX's text-following so strong.

Data quality > data quantity: The filtering pipeline removes roughly 80% of raw web video. The remaining 35M clips with dense captions give better results than training on the full, noisy corpus. This echoes a lesson from image generation: DALL-E 3's breakthrough was largely attributed to its recaptioning pipeline, not architectural changes.

Why does CogVideoX filter out "lecture type" videos from training?

They contain minimal effective motion — mostly a static person talking — which doesn't help the model learn diverse, dynamic video generation They are too long for the model to process They violate copyright

Chapter 8: Results

CogVideoX was released in two sizes: 2B and 5B parameters. Both are text-to-video models with additional image-to-video variants. Let's see how they stack up.

Automated Benchmarks

CogVideoX-5B achieves state-of-the-art in 5 out of 7 VBench metrics, and competitive results in the remaining two. The metrics that matter most for video quality are Dynamic Degree (how much motion is in the video) and Dynamic Quality (motion quality without sacrificing visual fidelity). Many competing models "cheat" by generating near-static videos that score high on visual quality but low on dynamism.

Model	Dynamic Degree	Multiple Objects	GPT4o-MT
T2V-Turbo	54.65	24.42	—
AnimateDiff	36.88	22.42	2.62
VideoCrafter-2.0	40.66	25.13	2.68
Gen-2	55.47	19.34	2.62
Pika	46.69	21.89	2.48
CogVideoX-2B	57.68	24.37	3.09
CogVideoX-5B	70.95	24.44	3.36

CogVideoX-5B's Dynamic Degree of 70.95 crushes Gen-2 (55.47) and Pika (46.69) — it generates videos with significantly more motion, which is the whole point of video generation.

Human Evaluation

The team compared CogVideoX-5B against Kling (one of the best closed-source models at the time) across four dimensions:

Aspect	Kling	CogVideoX-5B
Sensory Quality	0.638	0.722
Instruction Following	0.367	0.495
Physics Simulation	0.561	0.667
Cover Quality	0.668	0.712
Total	2.17	2.74

CogVideoX-5B wins on every dimension, with particularly strong gains in instruction following (+35%) and physics simulation (+19%). This validates the Expert AdaLN's deep text-video fusion.

Scaling works: CogVideoX-2B is already competitive with most open and closed models. CogVideoX-5B is definitively better across all metrics. The architecture is designed to scale further — the authors note that performance consistently improves with model size, data volume, and training compute.

What metric best captures CogVideoX-5B's advantage over competitors?

Dynamic Degree — CogVideoX-5B scores 70.95 vs Gen-2's 55.47, showing it generates videos with substantially more meaningful motion instead of near-static clips PSNR of the VAE Number of parameters

Chapter 9: Connections

CogVideoX sits at the intersection of several major research threads in video generation. Here's how it connects to the broader landscape.

Model	Relation to CogVideoX	Key Difference
Sora (OpenAI, 2024)	Same family: DiT backbone for video	Closed-source, reportedly similar architecture but larger scale
Stable Video Diffusion (Stability, 2023)	Uses 2D VAE + temporal fine-tuning	2D VAE causes flicker; CogVideoX's 3D VAE eliminates it
DiT (Peebles & Xie, 2023)	CogVideoX's backbone architecture	DiT is for images; CogVideoX extends it to 3D with Expert AdaLN
AnimateDiff (Guo et al., 2023)	Uses separated 2D+1D attention	CogVideoX uses full 3D attention for better motion coherence
Open-Sora (Zheng et al., 2024)	Open-source DiT for video	CogVideoX has better 3D VAE and Expert AdaLN; higher quality
SD3 / MMDiT (Esser et al., 2024)	Dual-stream text-image transformer	MMDiT doubles parameters with two streams; Expert AdaLN shares weights

The open-source impact: CogVideoX was the first commercial-grade open-source video generation model at its scale. The release of the 2B and 5B checkpoints, along with the 3D VAE and captioning model, catalyzed a wave of follow-up work and community experimentation. It demonstrated that open models can compete with closed-source systems like Gen-2 and Pika on human preference.

What CogVideoX got right

3D-first design — treating video as inherently 3D (not 2D + time afterthought) at every level: VAE, attention, positional encoding
Minimal modality-specific parameters — Expert AdaLN adds only normalization parameters per modality, not entire transformer streams
Data engineering as architecture — the captioning pipeline and filtering system are as important as the model design
Progressive training — a practical necessity that also improves final quality

Open questions

Can the 3D VAE be pushed further (16×16×8 compression) with better training recipes?
How does the approach scale to minute-long videos?
Can the Expert AdaLN framework extend to audio-video or video-action generation?

What is CogVideoX's key architectural advantage over Stable Video Diffusion?

CogVideoX uses a 3D VAE that compresses temporally (eliminating flicker) and full 3D attention (enabling direct cross-frame object tracking), while SVD uses a 2D VAE with temporal layers added post-hoc CogVideoX has more parameters CogVideoX uses CLIP instead of T5

CogVideoX: Expert Transformer for Video

Chapter 0: The Problem

Chapter 1: The Key Insight

Chapter 2: The 3D VAE

Architecture

Ablation: How much compression?

Chapter 3: The Expert Transformer

Why not just concatenate and go?

Expert Adaptive LayerNorm

3D Full Attention

3D Rotary Position Embedding (3D-RoPE)

Chapter 4: Text-Video Alignment

T5 Text Encoder

Deep Fusion via Concatenation

Video Captioning Pipeline

Chapter 5: Progressive Training

Resolution Progression

Multi-Resolution Frame Pack

Duration Progression

Chapter 6: Explicit Uniform Sampling

The problem with random sampling

The fix: Explicit Uniform Sampling

Chapter 7: Data Pipeline

Video Filtering

Video Captioning Pipeline

Chapter 8: Results

Automated Benchmarks

Human Evaluation

Chapter 9: Connections

What CogVideoX got right

Open questions