Scaling latent video diffusion models to large datasets — systematic data curation matters more than architecture for high-quality video generation.
You can generate stunning images with diffusion models. A single prompt produces a photorealistic 1024-pixel scene in seconds. But now try this: generate a video of the same scene. Twenty-five frames, each at 576×1024. Suddenly everything falls apart.
The first frame looks great. The second is plausible. By frame five, the dog has grown a third ear. By frame fourteen, the background has drifted into a different country. The fundamental problem is temporal coherence: making every frame look individually good is hard enough, but making them look good together — maintaining consistent objects, lighting, and motion across time — is a much harder problem.
Why is video so much harder than images? Three reasons:
Drag the frame count slider to see how the pixel count (and modeling difficulty) scales. Each frame must be consistent with all others.
Prior to SVD, most video generation research focused on architectural innovations: what kind of temporal layers to add, how to arrange spatial and temporal attention, whether to use 3D convolutions or factored 2D+1D. The implicit assumption was that the model design is what matters most.
SVD flips this assumption. The paper's central finding: systematic data curation is more important than model architecture. A simple architecture trained on well-curated data consistently beats a sophisticated architecture trained on uncurated data.
The paper identifies two key principles:
This insight has profound implications. While the research community was racing to invent better temporal attention mechanisms, SVD showed that you could use a relatively standard architecture (from Align Your Latents, Blattmann et al. 2023) and focus your effort on building a better dataset pipeline. The result: state-of-the-art image-to-video generation, competitive text-to-video, and a model that transfers to multi-view 3D synthesis.
The raw internet is full of video, but most of it is terrible training data. Static security camera footage. Slideshows with no motion. Videos with burned-in text overlays. Clips with jarring cuts in the middle. How do you turn 580 million raw video clips into a clean dataset for learning motion?
SVD introduces a systematic curation pipeline with four stages:
Raw videos often contain hidden cuts — scene transitions, fades, jump cuts that don't appear in metadata. SVD applies a cascaded cut detection pipeline at three different FPS levels, revealing roughly 4× more cuts than metadata suggests. After splitting, the average clips-per-video jumps from ~2.65 to ~11.09.
Each clip gets three captions: (1) CoCa image captioner on the middle frame, (2) V-BLIP video captioner for temporal descriptions, and (3) an LLM-based summary merging both. This multi-source approach captures both static scene content and dynamic motion.
Every clip is annotated with dense optical flow (at 2 FPS), OCR text detection, CLIP embeddings on first/middle/last frames, and aesthetic scores. These annotations become the filtering criteria.
This is where the magic happens. For each annotation type, SVD systematically removes the bottom 12.5%, 25%, and 50% of clips, trains a model on each subset, and runs human preference studies to find the optimal threshold. The result: the Large Video Dataset (LVD) of 580M clips is filtered down to LVD-F of 152M clips.
Watch how each filtering stage reduces the dataset while improving quality. Click each stage to see what it removes.
SVD identifies three distinct training stages, each serving a different purpose. Think of it as building a house: the foundation (image understanding), the structure (motion understanding), and the finishing (high-quality output).
Start with Stable Diffusion 2.1 — a powerful text-to-image model already trained on billions of images. This gives the model a strong visual representation: it understands objects, scenes, lighting, texture, and composition. The spatial layers (2D convolutions, spatial attention) come fully trained.
The ablation is decisive: a video model initialized from SD 2.1 clearly outperforms one with randomly initialized spatial weights in both visual quality and prompt alignment.
Insert temporal layers (temporal convolutions + temporal attention) after every spatial layer, then train on the curated LVD-F dataset. This stage runs at 256×384 resolution with 14 frames using the EDM noise schedule. The model learns its core motion representation here: how objects move, how cameras pan, what physically plausible motion looks like.
Crucially, the full model is fine-tuned (not just temporal layers), and the noise schedule is shifted toward higher noise values — essential for later high-resolution fine-tuning.
Fine-tune on a small (~250K clips), highly curated dataset of visually stunning videos at 576×1024 resolution. The noise schedule shifts further toward higher noise. This stage runs for 50K iterations and produces the final model.
Click Play to animate the training pipeline. Watch how each stage builds on the previous, transferring learned representations forward. Drag the progress slider to explore specific stages.
SVD's architecture is deliberately simple — the paper's thesis is that data matters more than architecture. It builds on the Video LDM framework from Align Your Latents (Blattmann et al., 2023) with a few important modifications.
Like Stable Diffusion, SVD operates in latent space. A pretrained autoencoder compresses each video frame from pixel space (3×H×W) into a latent representation (~4× spatial compression). The diffusion model operates on these latents, and the decoder reconstructs pixel frames. This reduces computational cost by ~64× compared to pixel-space diffusion.
The key modification to turn an image U-Net into a video U-Net: after every spatial convolution block, insert a temporal convolution (1D conv along the time axis). After every spatial attention block, insert a temporal attention block. The temporal convolutions capture local motion patterns (frame-to-frame changes), while temporal attention captures longer-range dependencies (an object that disappears and reappears).
SVD adopts the EDM (Elucidated Diffusion Model) framework from Karras et al. (2022) instead of the original DDPM formulation. EDM provides a cleaner noise schedule parameterization and better training dynamics. Critically, SVD shifts the noise schedule toward higher noise values for high-resolution training — the model sees noisier inputs during fine-tuning, which prevents overfitting to the small HQ dataset.
SVD conditions the model on frame rate as a micro-conditioning signal, similar to how Stable Diffusion XL conditions on image resolution. This lets the model learn that fast motion (high FPS) and slow motion (low FPS) are different generation modes, not conflicting training signals.
The architecture interleaves spatial and temporal layers. Spatial layers process each frame independently; temporal layers mix information across frames. Toggle to see which layers are spatial vs. temporal.
Text-to-video is impressive, but image-to-video is arguably more useful: you provide a single image, and the model generates a video that brings it to life. The input image becomes the first frame, and the model halluccinates plausible motion forward in time.
SVD converts its text-to-video base model into an image-to-video model with two modifications:
The conditioning frame gets noise added to it before concatenation. This is a crucial trick from the cascaded diffusion literature (Ho et al., 2022). Without noise augmentation, the model overfits to the exact pixel values of the conditioning frame. With noise augmentation, it learns to use the frame as a guide rather than a constraint, producing more natural motion.
Standard classifier-free guidance uses a constant guidance scale across all frames. SVD found this causes problems: too little guidance makes early frames inconsistent with the conditioning image; too much causes oversaturation in later frames. The solution: linearly increase the guidance scale along the frame axis — low guidance for early frames (where the conditioning frame already provides strong signal) and higher guidance for later frames (where the model needs more steering).
SVD's learned motion prior is so strong that specific camera motions can be controlled via lightweight LoRA modules trained on small datasets with specific motion metadata. Three LoRAs are demonstrated: horizontal panning, zooming, and static camera. Each LoRA is trained only on the temporal attention blocks and can be efficiently plugged in at inference time.
Here is a surprising application: a video diffusion model can generate multiple consistent views of a 3D object. Think about it — a video of a camera orbiting an object is essentially a multi-view sequence. If the model has learned a strong 3D prior from watching videos of the real world, you can fine-tune it to produce orbital camera paths around objects.
SVD's image-to-video model (SVD) is fine-tuned on multi-view datasets to create SVD-MV. The training data comes from two sources:
The paper runs a clean ablation comparing three initializations for multi-view fine-tuning:
SVD-MV crushes both alternatives on all metrics (PSNR, LPIPS, CLIP-S) on the GSO test dataset. It even outperforms dedicated novel-view synthesis methods like Zero123XL and SyncDreamer, despite training for only ~12K steps (16 hours on 8 A100s) compared to days of training for specialized methods.
| Method | LPIPS ↓ | PSNR ↑ | CLIP-S ↑ |
|---|---|---|---|
| SyncDreamer | 0.18 | 15.29 | 0.88 |
| Zero123XL | 0.20 | 14.51 | 0.87 |
| Scratch-MV | 0.22 | 14.20 | 0.76 |
| SD2.1-MV | 0.18 | 15.06 | 0.83 |
| SVD-MV (ours) | 0.14 | 16.83 | 0.89 |
SVD achieves state-of-the-art results across multiple tasks. Let's look at the numbers and what they mean.
The base model (Stage II output) achieves FVD of 242.02 on UCF-101, dramatically outperforming all prior methods. For context: CogVideo scored 701.59, Make-A-Video scored 367.23, and the previous Video LDM scored 550.61. SVD cuts the best prior FVD nearly in half.
Human preference studies comparing SVD's 25-frame image-to-video model against GEN-2 (Runway) and PikaLabs show clear preference for SVD in visual quality. This is notable because GEN-2 and PikaLabs are commercial products with proprietary training data and compute budgets.
SVD-MV achieves the best LPIPS (0.14), PSNR (16.83), and CLIP-S (0.89) on the GSO test set, outperforming Zero123XL and SyncDreamer at a fraction of their compute cost.
Lower FVD is better. SVD's base model dramatically outperforms all prior methods on zero-shot text-to-video generation.
SVD can be fine-tuned into a frame interpolation model that predicts 3 intermediate frames between 2 conditioning frames, effectively increasing frame rate by 4×. Remarkably, only ~10K iterations of fine-tuning suffice for good performance, demonstrating the strength of the learned motion prior.
This chapter is the heart of SVD's scientific contribution. The paper runs a series of careful ablations isolating the effect of data curation from all other variables. The conclusions are stark.
Train two identical models on LVD-10M (9.8M uncurated clips) vs. LVD-10M-F (2.3M curated clips). Same architecture, same training hyperparameters, same number of steps. Result: the model trained on the 4× smaller curated dataset is preferred by human evaluators in both quality and prompt alignment.
Compare LVD-10M-F against WebVid-10M (the most popular research dataset) and InternVid-10M (specifically filtered for high aesthetics). Despite being 4× smaller than both, LVD-10M-F produces the preferred model. SVD's curation strategy beats both hand-curated and scale-focused alternatives.
Does curation still help at larger scale? Train on 50M curated vs. 50M uncurated clips. Yes — the curated model is still clearly preferred. And 50M curated beats 10M curated, confirming that scale and curation are complementary, not substitutes.
This is the most surprising result. Take three models that differ only in their Stage II initialization: (1) from image model (no video pretraining), (2) from uncurated video pretraining, (3) from curated video pretraining. Fine-tune all three identically on Stage III data for 50K steps. At 10K steps: curated > uncurated > image-only. At 50K steps: same ranking. The advantages of curated pretraining are permanent.
Compare models initialized from different pretraining data during Stage III fine-tuning. Drag the fine-tuning step slider to see how quality differences persist even after extensive fine-tuning on the same HQ data.
SVD sits at a critical juncture in the video generation timeline. Let's map where it came from, what it enabled, and where the field went next.
| Work | Year | Key Contribution | Relation to SVD |
|---|---|---|---|
| LDM / Stable Diffusion | 2022 | Latent diffusion for image generation | SVD's Stage I foundation — the spatial backbone |
| Align Your Latents (Video LDM) | 2023 | Insert temporal layers into image LDM | SVD's architectural template — temporal conv + attention |
| Make-A-Video | 2022 | Text-to-image → text-to-video without paired data | Similar staged training; SVD adds systematic curation study |
| Imagen Video | 2022 | Cascaded pixel-space video diffusion | Pixel-space alternative; SVD's latent approach is more efficient |
| EDM (Karras et al.) | 2022 | Elucidated diffusion framework | SVD adopts EDM for cleaner noise schedule + preconditioning |
| Work | Year | How SVD Influenced It |
|---|---|---|
| Stable Video Diffusion XT | 2024 | Extended SVD to longer, higher-quality videos |
| SV3D | 2024 | Built on SVD-MV for 3D generation from single images |
| Sora (OpenAI) | 2024 | Validated the three-stage approach at massive scale; added DiT backbone |
| CogVideoX | 2024 | Open-source video generation using 3D VAE + expert transformer |
| Video Foundation Models | 2024+ | SVD's insight that video models learn 3D priors influenced world model research |