Stable Video Diffusion

Chapter 0: The Problem

You can generate stunning images with diffusion models. A single prompt produces a photorealistic 1024-pixel scene in seconds. But now try this: generate a video of the same scene. Twenty-five frames, each at 576×1024. Suddenly everything falls apart.

The first frame looks great. The second is plausible. By frame five, the dog has grown a third ear. By frame fourteen, the background has drifted into a different country. The fundamental problem is temporal coherence: making every frame look individually good is hard enough, but making them look good together — maintaining consistent objects, lighting, and motion across time — is a much harder problem.

Why is video so much harder than images? Three reasons:

Exponential complexity. A single 576×1024 image has ~590K pixels. A 25-frame video has ~14.7M pixels, all of which must be jointly consistent. The search space explodes.
Motion modeling. The model must learn not just what the world looks like, but how it moves. Objects have inertia, cameras pan smoothly, water flows downstream. This requires understanding physics.
Data scarcity. High-quality image datasets contain billions of well-captioned examples. High-quality video datasets? Orders of magnitude smaller, noisier, and poorly captioned.

Image vs. Video Complexity

Drag the frame count slider to see how the pixel count (and modeling difficulty) scales. Each frame must be consistent with all others.

Frames 1

The central question: By late 2023, latent diffusion models had been turned into video generators by inserting temporal layers and fine-tuning on small video datasets. But training strategies varied wildly, and nobody had systematically studied what matters most. Is it the architecture? The training procedure? Or the data? Stable Video Diffusion answers this decisively: it's the data.

Why is generating temporally coherent video fundamentally harder than generating single images?

Every frame must be individually high-quality AND jointly consistent across time — requiring the model to understand motion, physics, and maintain object identity across 14-25 frames Videos have more pixels per frame Video models use a different architecture

Chapter 1: The Key Insight

Prior to SVD, most video generation research focused on architectural innovations: what kind of temporal layers to add, how to arrange spatial and temporal attention, whether to use 3D convolutions or factored 2D+1D. The implicit assumption was that the model design is what matters most.

SVD flips this assumption. The paper's central finding: systematic data curation is more important than model architecture. A simple architecture trained on well-curated data consistently beats a sophisticated architecture trained on uncurated data.

The paper identifies two key principles:

Three-stage training. Don't train a video model from scratch. Start from a powerful image model (Stage I), pretrain on a large curated video dataset (Stage II), then fine-tune on a small high-quality video dataset (Stage III). Each stage builds on the previous.
Curation over scale. A 2.3M-clip curated dataset outperforms a 10M-clip uncurated dataset. Quality beats quantity. And remarkably, the benefits of good curation in Stage II persist even after Stage III fine-tuning on entirely different data.

The surprising result: When you compare two models that differ only in whether their Stage II pretraining data was curated, the curated model is still clearly better after 50K steps of Stage III fine-tuning on the same high-quality dataset. Data curation doesn't just help during pretraining — its effects are permanent. The model that learned better motion priors during pretraining never loses that advantage.

This insight has profound implications. While the research community was racing to invent better temporal attention mechanisms, SVD showed that you could use a relatively standard architecture (from Align Your Latents, Blattmann et al. 2023) and focus your effort on building a better dataset pipeline. The result: state-of-the-art image-to-video generation, competitive text-to-video, and a model that transfers to multi-view 3D synthesis.

What is SVD's core finding about what matters most for video generation quality?

Systematic data curation matters more than model architecture — a simple model on curated data beats a complex model on uncurated data Larger models always produce better videos More training steps are the key factor

Chapter 2: Data Curation

The raw internet is full of video, but most of it is terrible training data. Static security camera footage. Slideshows with no motion. Videos with burned-in text overlays. Clips with jarring cuts in the middle. How do you turn 580 million raw video clips into a clean dataset for learning motion?

SVD introduces a systematic curation pipeline with four stages:

Step 1: Cut Detection

Raw videos often contain hidden cuts — scene transitions, fades, jump cuts that don't appear in metadata. SVD applies a cascaded cut detection pipeline at three different FPS levels, revealing roughly 4× more cuts than metadata suggests. After splitting, the average clips-per-video jumps from ~2.65 to ~11.09.

Step 2: Caption Generation

Each clip gets three captions: (1) CoCa image captioner on the middle frame, (2) V-BLIP video captioner for temporal descriptions, and (3) an LLM-based summary merging both. This multi-source approach captures both static scene content and dynamic motion.

Step 3: Annotation

Every clip is annotated with dense optical flow (at 2 FPS), OCR text detection, CLIP embeddings on first/middle/last frames, and aesthetic scores. These annotations become the filtering criteria.

Step 4: Filtering

This is where the magic happens. For each annotation type, SVD systematically removes the bottom 12.5%, 25%, and 50% of clips, trains a model on each subset, and runs human preference studies to find the optimal threshold. The result: the Large Video Dataset (LVD) of 580M clips is filtered down to LVD-F of 152M clips.

Data Curation Pipeline

Watch how each filtering stage reduces the dataset while improving quality. Click each stage to see what it removes.

Key numbers: LVD starts at 580M clips (212 years of content). After filtering: LVD-F has 152M clips (50.6 years). The filtered dataset is 4× smaller but produces clearly better models. For ablations, LVD-10M (9.8M random subset) vs LVD-10M-F (2.3M curated subset) shows the same pattern: the 4× smaller curated set wins on human preference in both visual quality and prompt alignment.

How does SVD determine the optimal filtering threshold for each annotation type (motion, aesthetics, OCR)?

They train separate models on subsets with different filtering cutoffs and use human preference studies (Elo rankings) to select the threshold that produces the best model They use a fixed percentile cutoff for all metrics They manually inspect random samples

Chapter 3: Three-Stage Training

SVD identifies three distinct training stages, each serving a different purpose. Think of it as building a house: the foundation (image understanding), the structure (motion understanding), and the finishing (high-quality output).

Stage I: Image Pretraining

Start with Stable Diffusion 2.1 — a powerful text-to-image model already trained on billions of images. This gives the model a strong visual representation: it understands objects, scenes, lighting, texture, and composition. The spatial layers (2D convolutions, spatial attention) come fully trained.

The ablation is decisive: a video model initialized from SD 2.1 clearly outperforms one with randomly initialized spatial weights in both visual quality and prompt alignment.

Stage II: Video Pretraining

Insert temporal layers (temporal convolutions + temporal attention) after every spatial layer, then train on the curated LVD-F dataset. This stage runs at 256×384 resolution with 14 frames using the EDM noise schedule. The model learns its core motion representation here: how objects move, how cameras pan, what physically plausible motion looks like.

Crucially, the full model is fine-tuned (not just temporal layers), and the noise schedule is shifted toward higher noise values — essential for later high-resolution fine-tuning.

Stage III: High-Quality Fine-tuning

Fine-tune on a small (~250K clips), highly curated dataset of visually stunning videos at 576×1024 resolution. The noise schedule shifts further toward higher noise. This stage runs for 50K iterations and produces the final model.

Three-Stage Training Pipeline

Click Play to animate the training pipeline. Watch how each stage builds on the previous, transferring learned representations forward. Drag the progress slider to explore specific stages.

Stage I: Image

Stage I: Image Pretraining

Stable Diffusion 2.1 — billions of images. Learns visual representation. Spatial layers fully trained.

↓ transfer spatial weights

Stage II: Video Pretraining

152M curated clips (LVD-F) at 256×384, 14 frames. Insert temporal layers. Full model fine-tuning. Learns motion prior.

↓ transfer full model

Stage III: HQ Fine-tuning

~250K high-quality clips at 576×1024. Shifted noise schedule. 50K iterations. Final model.

Why three stages? Each stage operates at a different scale-quality tradeoff. Stage I: massive scale, image-only, learns what the world looks like. Stage II: large scale, curated video, learns how the world moves. Stage III: small scale, premium quality, learns to produce beautiful output. Skipping Stage II (going directly from images to HQ fine-tuning) produces clearly worse results — the model never learns a general motion prior.

What does each training stage contribute to the final model?

Stage I provides visual understanding from image pretraining, Stage II learns general motion priors from large-scale curated video, Stage III refines output quality through high-resolution fine-tuning All three stages train the same thing at different resolutions Stage I trains temporal layers, Stage II trains spatial layers, Stage III combines them

Chapter 4: Architecture

SVD's architecture is deliberately simple — the paper's thesis is that data matters more than architecture. It builds on the Video LDM framework from Align Your Latents (Blattmann et al., 2023) with a few important modifications.

The Base: Latent Diffusion

Like Stable Diffusion, SVD operates in latent space. A pretrained autoencoder compresses each video frame from pixel space (3×H×W) into a latent representation (~4× spatial compression). The diffusion model operates on these latents, and the decoder reconstructs pixel frames. This reduces computational cost by ~64× compared to pixel-space diffusion.

Temporal Layers

The key modification to turn an image U-Net into a video U-Net: after every spatial convolution block, insert a temporal convolution (1D conv along the time axis). After every spatial attention block, insert a temporal attention block. The temporal convolutions capture local motion patterns (frame-to-frame changes), while temporal attention captures longer-range dependencies (an object that disappears and reappears).

EDM Framework

SVD adopts the EDM (Elucidated Diffusion Model) framework from Karras et al. (2022) instead of the original DDPM formulation. EDM provides a cleaner noise schedule parameterization and better training dynamics. Critically, SVD shifts the noise schedule toward higher noise values for high-resolution training — the model sees noisier inputs during fine-tuning, which prevents overfitting to the small HQ dataset.

Micro-Conditioning

SVD conditions the model on frame rate as a micro-conditioning signal, similar to how Stable Diffusion XL conditions on image resolution. This lets the model learn that fast motion (high FPS) and slow motion (low FPS) are different generation modes, not conflicting training signals.

Video U-Net Architecture

The architecture interleaves spatial and temporal layers. Spatial layers process each frame independently; temporal layers mix information across frames. Toggle to see which layers are spatial vs. temporal.

Why full fine-tuning? Many prior works only trained the newly inserted temporal layers, keeping spatial layers frozen. SVD fine-tunes the entire model. The reasoning: video frames are not independent images. Spatial features need to adapt to support temporal coherence. A face detector trained on individual photos needs adjustment when those photos are adjacent frames — the features need to encode identity-preserving information, not just per-frame aesthetics.

What are the two types of temporal layers inserted into the image U-Net to create the video U-Net?

Temporal convolutions (1D conv along time axis for local motion) after each spatial conv, and temporal attention blocks (for long-range temporal dependencies) after each spatial attention 3D convolutions that replace all 2D convolutions Recurrent layers (LSTM) inserted between encoder and decoder

Chapter 5: Image-to-Video Generation

Text-to-video is impressive, but image-to-video is arguably more useful: you provide a single image, and the model generates a video that brings it to life. The input image becomes the first frame, and the model halluccinates plausible motion forward in time.

How It Works

SVD converts its text-to-video base model into an image-to-video model with two modifications:

Replace text conditioning with image conditioning. Instead of feeding CLIP text embeddings into the cross-attention layers, feed the CLIP image embedding of the conditioning frame.
Concatenate the conditioning frame. A noise-augmented version of the input image is concatenated channel-wise to the noisy latent input at every denoising step. The same frame is copied across the entire time axis — no masking tricks needed.

Noise Augmentation

The conditioning frame gets noise added to it before concatenation. This is a crucial trick from the cascaded diffusion literature (Ho et al., 2022). Without noise augmentation, the model overfits to the exact pixel values of the conditioning frame. With noise augmentation, it learns to use the frame as a guide rather than a constraint, producing more natural motion.

Progressive Guidance

Standard classifier-free guidance uses a constant guidance scale across all frames. SVD found this causes problems: too little guidance makes early frames inconsistent with the conditioning image; too much causes oversaturation in later frames. The solution: linearly increase the guidance scale along the frame axis — low guidance for early frames (where the conditioning frame already provides strong signal) and higher guidance for later frames (where the model needs more steering).

14 vs. 25 frames: SVD trains two image-to-video models: one generating 14 frames and one generating 25 frames. Both use 576×1024 resolution. The 25-frame model produces smoother, longer videos that are preferred by human evaluators over commercial systems like GEN-2 and PikaLabs. Frame interpolation (Section 4.4 of the paper) can further increase frame rate by 4×.

Camera Motion Control

SVD's learned motion prior is so strong that specific camera motions can be controlled via lightweight LoRA modules trained on small datasets with specific motion metadata. Three LoRAs are demonstrated: horizontal panning, zooming, and static camera. Each LoRA is trained only on the temporal attention blocks and can be efficiently plugged in at inference time.

Why does SVD add noise to the conditioning image before concatenating it to the model input?

Without noise augmentation, the model overfits to the exact pixel values of the conditioning frame — noise teaches it to use the image as a flexible guide, producing more natural motion To make training faster To reduce memory usage during inference

Chapter 6: Multi-View Generation

Here is a surprising application: a video diffusion model can generate multiple consistent views of a 3D object. Think about it — a video of a camera orbiting an object is essentially a multi-view sequence. If the model has learned a strong 3D prior from watching videos of the real world, you can fine-tune it to produce orbital camera paths around objects.

From Video to 3D

SVD's image-to-video model (SVD) is fine-tuned on multi-view datasets to create SVD-MV. The training data comes from two sources:

Objaverse: 150K curated synthetic 3D objects rendered as 360-degree orbital videos (21 frames per object). The model is additionally conditioned on the elevation angle.
MVImgNet: ~200K casually captured multi-view videos of real household objects.

Why Video Priors Help

The paper runs a clean ablation comparing three initializations for multi-view fine-tuning:

Scratch-MV: Random initialization — no prior
SD2.1-MV: Initialized from image model (Stable Diffusion 2.1) — image prior only
SVD-MV: Initialized from SVD — full video prior

SVD-MV crushes both alternatives on all metrics (PSNR, LPIPS, CLIP-S) on the GSO test dataset. It even outperforms dedicated novel-view synthesis methods like Zero123XL and SyncDreamer, despite training for only ~12K steps (16 hours on 8 A100s) compared to days of training for specialized methods.

The video prior advantage: After just 1K fine-tuning iterations, SVD-MV already has better CLIP similarity and PSNR than the image-prior and no-prior models at convergence. The video pretraining taught the model about 3D consistency and view-dependent appearance — knowledge that transfers directly to multi-view synthesis. This is a compelling argument that video models are not just video generators, but general-purpose 3D-aware models.

Method	LPIPS ↓	PSNR ↑	CLIP-S ↑
SyncDreamer	0.18	15.29	0.88
Zero123XL	0.20	14.51	0.87
Scratch-MV	0.22	14.20	0.76
SD2.1-MV	0.18	15.06	0.83
SVD-MV (ours)	0.14	16.83	0.89

Why does initializing from a video model (SVD-MV) dramatically outperform initializing from an image model (SD2.1-MV) for multi-view generation?

Video pretraining teaches the model about 3D consistency and view-dependent appearance through natural videos — this knowledge transfers directly to multi-view synthesis, giving SVD-MV a massive head start The video model has more parameters The video model trains for more total steps

Chapter 7: Results

SVD achieves state-of-the-art results across multiple tasks. Let's look at the numbers and what they mean.

Zero-Shot Text-to-Video (UCF-101)

The base model (Stage II output) achieves FVD of 242.02 on UCF-101, dramatically outperforming all prior methods. For context: CogVideo scored 701.59, Make-A-Video scored 367.23, and the previous Video LDM scored 550.61. SVD cuts the best prior FVD nearly in half.

Image-to-Video

Human preference studies comparing SVD's 25-frame image-to-video model against GEN-2 (Runway) and PikaLabs show clear preference for SVD in visual quality. This is notable because GEN-2 and PikaLabs are commercial products with proprietary training data and compute budgets.

Multi-View Synthesis

SVD-MV achieves the best LPIPS (0.14), PSNR (16.83), and CLIP-S (0.89) on the GSO test set, outperforming Zero123XL and SyncDreamer at a fraction of their compute cost.

FVD Comparison (UCF-101 Zero-Shot)

Lower FVD is better. SVD's base model dramatically outperforms all prior methods on zero-shot text-to-video generation.

Frame Interpolation

SVD can be fine-tuned into a frame interpolation model that predicts 3 intermediate frames between 2 conditioning frames, effectively increasing frame rate by 4×. Remarkably, only ~10K iterations of fine-tuning suffice for good performance, demonstrating the strength of the learned motion prior.

Practical significance: SVD is not just an academic exercise. The 14-frame and 25-frame image-to-video models were released as open-source, becoming one of the first high-quality open video generation models. The code and weights are available at github.com/Stability-AI/generative-models. This democratized video generation research in a way that closed-source models from commercial labs could not.

What is SVD's FVD score on UCF-101 zero-shot text-to-video, and how does it compare to the next best method?

SVD achieves 242.02, nearly halving the previous best of 355.20 (PYOCO) — a dramatic improvement on this benchmark SVD achieves 500, slightly better than Video LDM SVD achieves 100, the first model to break triple digits

Chapter 8: Data Matters

This chapter is the heart of SVD's scientific contribution. The paper runs a series of careful ablations isolating the effect of data curation from all other variables. The conclusions are stark.

Ablation 1: Curated vs. Uncurated at Small Scale

Train two identical models on LVD-10M (9.8M uncurated clips) vs. LVD-10M-F (2.3M curated clips). Same architecture, same training hyperparameters, same number of steps. Result: the model trained on the 4× smaller curated dataset is preferred by human evaluators in both quality and prompt alignment.

Ablation 2: LVD-F vs. Existing Datasets

Compare LVD-10M-F against WebVid-10M (the most popular research dataset) and InternVid-10M (specifically filtered for high aesthetics). Despite being 4× smaller than both, LVD-10M-F produces the preferred model. SVD's curation strategy beats both hand-curated and scale-focused alternatives.

Ablation 3: Curation Scales

Does curation still help at larger scale? Train on 50M curated vs. 50M uncurated clips. Yes — the curated model is still clearly preferred. And 50M curated beats 10M curated, confirming that scale and curation are complementary, not substitutes.

Ablation 4: Benefits Persist After Fine-tuning

This is the most surprising result. Take three models that differ only in their Stage II initialization: (1) from image model (no video pretraining), (2) from uncurated video pretraining, (3) from curated video pretraining. Fine-tune all three identically on Stage III data for 50K steps. At 10K steps: curated > uncurated > image-only. At 50K steps: same ranking. The advantages of curated pretraining are permanent.

Ablation: Data Curation Impact

Compare models initialized from different pretraining data during Stage III fine-tuning. Drag the fine-tuning step slider to see how quality differences persist even after extensive fine-tuning on the same HQ data.

Fine-tune Steps 0K

Why do curation benefits persist? The most likely explanation: curated pretraining teaches the model a better internal representation of motion. A model that learned motion from high-quality, well-captioned, properly segmented clips develops richer motion features than one trained on static scenes and jump cuts. These features form the foundation that Stage III builds on. Fine-tuning can improve output quality, but it can't fundamentally restructure the motion representation learned during pretraining.

What is the most surprising finding from SVD's data curation ablations?

The quality advantages from curated pretraining persist even after 50K steps of fine-tuning on the same high-quality dataset — curated pretraining creates a permanently better motion representation Larger datasets always produce better models Curation only helps during pretraining, not after fine-tuning

Chapter 9: Connections

SVD sits at a critical juncture in the video generation timeline. Let's map where it came from, what it enabled, and where the field went next.

Predecessors

Work	Year	Key Contribution	Relation to SVD
LDM / Stable Diffusion	2022	Latent diffusion for image generation	SVD's Stage I foundation — the spatial backbone
Align Your Latents (Video LDM)	2023	Insert temporal layers into image LDM	SVD's architectural template — temporal conv + attention
Make-A-Video	2022	Text-to-image → text-to-video without paired data	Similar staged training; SVD adds systematic curation study
Imagen Video	2022	Cascaded pixel-space video diffusion	Pixel-space alternative; SVD's latent approach is more efficient
EDM (Karras et al.)	2022	Elucidated diffusion framework	SVD adopts EDM for cleaner noise schedule + preconditioning

Successors and Influence

Work	Year	How SVD Influenced It
Stable Video Diffusion XT	2024	Extended SVD to longer, higher-quality videos
SV3D	2024	Built on SVD-MV for 3D generation from single images
Sora (OpenAI)	2024	Validated the three-stage approach at massive scale; added DiT backbone
CogVideoX	2024	Open-source video generation using 3D VAE + expert transformer
Video Foundation Models	2024+	SVD's insight that video models learn 3D priors influenced world model research

Key Lessons for the Field

Data curation is underrated. The research community's focus on architecture may be misallocated. SVD showed that data engineering provides larger returns.
Staged training transfers knowledge. Image → video → HQ is a pattern that generalizes: each stage specializes the model further.
Video models are 3D-aware. SVD-MV proved that video diffusion models implicitly learn 3D structure, opening the door to video-as-3D-prior research.
Open models accelerate progress. By releasing weights, SVD enabled hundreds of downstream applications and research projects.

Looking forward: SVD's three-stage recipe became standard practice for video generation. Sora, CogVideoX, and other 2024 models all use variants of image pretraining → video pretraining → high-quality fine-tuning. The specific architectures evolved (DiT replaced U-Net, 3D VAE replaced frame-wise encoding), but the training philosophy SVD established remains foundational. The lesson is timeless: build your data pipeline before you build your model.

What lasting lesson from SVD has influenced nearly all subsequent video generation work?

The three-stage training recipe (image pretrain → video pretrain on curated data → HQ fine-tune) became the standard approach, and the emphasis on data curation over architecture persisted across all major subsequent models All subsequent models copied SVD's exact U-Net architecture The LoRA approach for camera control was the main contribution