A single image is a frozen moment. Video is a living story — full of motion, rhythm, and temporal structure. How do we teach neural networks to see through time?
You already know how to classify images. A CNN looks at a 3×H×W tensor (three color channels, height, width) and outputs a label: "cat," "truck," "airplane." But the real world isn't frozen. A person isn't just standing — they're running, jumping, throwing. To recognize actions, you need to see how things change over time.
A video is just a sequence of images — frames — stacked along a new axis: time. So instead of a 3D tensor (3×H×W), a video clip is a 4D tensor: T×3×H×W, where T is the number of frames.
That extra dimension — time — is what makes video understanding both powerful and brutally expensive.
Videos are recorded at roughly 30 frames per second. A single minute of uncompressed HD video (1920×1080) weighs about 10 GB. Even standard definition (640×480) is ~1.5 GB per minute. You cannot feed raw video into a neural network. The GPU would run out of memory before processing the first second.
One HD frame: 1920 × 1080 × 3 bytes = 6.2 MB. At 30 fps, one second = 186 MB. One minute = 11.2 GB. Compare to a single ImageNet image: 224 × 224 × 3 = 150 KB. A one-minute video is roughly 75,000× larger than one ImageNet image.
The solution: train on short clips at low frame rate and low spatial resolution. A typical setup: T=16 frames sampled at 5 fps, with H=W=112. That's 3.2 seconds of content in only 588 KB — manageable.
During training, you sample random short clips from long videos. Each clip gets a label (the action class for that segment). During testing, you run the model on multiple clips from the same video and average the predictions. This "clip-then-aggregate" strategy is universal in video classification.
Given a clip of T frames, how should a neural network process the temporal dimension? Should it look at frames independently? Should it fuse them early, late, or gradually? Should it use 2D convolutions, 3D convolutions, or attention? Every architecture in this lecture answers this question differently.
Given a video clip (T × 3 × H × W), predict an action label such as "running," "swimming," or "playing guitar." This is the video analog of image classification, but the label depends on temporal patterns, not just appearance.
| Property | Image Classification | Video Classification |
|---|---|---|
| Input shape | 3 × H × W | T × 3 × H × W |
| Recognizes | Objects (dog, car, tree) | Actions (running, jumping, eating) |
| Key signal | Spatial features | Spatial + temporal features |
| Compute cost | ~10 GFLOPs (ResNet-50) | ~100+ GFLOPs (typical video model) |
| Data size | ~150 KB per image | ~600 KB per clip (low-res) |
Here's a humbling fact: a single-frame CNN — one that classifies each frame independently and averages the predictions — is often a very strong baseline for video classification. On many benchmarks, it performs within a few percentage points of complex temporal models. This tells us that many "action recognition" datasets can be solved mostly by appearance (a person holding a tennis racket is probably playing tennis). True temporal reasoning is harder to benchmark than it sounds.
The simplest approach to video: ignore time entirely. Run a 2D CNN on each frame independently, average the class probabilities, and call it a day. This is the single-frame CNN baseline, and as we just noted, it's embarrassingly competitive.
But if you want to do better, you need to fuse temporal information somehow. The question is when in the network to combine frames. This gives us three strategies, first explored systematically by Karpathy et al. (2014).
Run a 2D CNN on each frame independently to extract per-frame features. Then combine the features at the very end — either by flattening and feeding to an MLP, or by average-pooling across frames and time.
The advantage: you reuse a standard pretrained 2D CNN (like ResNet). The problem: the CNN never compares frames directly. It can't detect low-level motion patterns because each frame is processed in isolation until the very end.
Stack all T frames along the channel dimension to create a single "fat" image of shape (3T) × H × W. Feed this directly into a standard 2D CNN. The very first convolution layer now has access to all temporal information.
The advantage: the first layer can compare pixels across frames and detect motion. The problem: one convolutional layer may not be enough temporal processing. After the first layer, the network is a regular 2D CNN with no temporal dimension left.
Use 3D convolutions and 3D pooling throughout the network. Each layer operates on a 4D tensor (D × T × H × W), gradually reducing the temporal dimension alongside the spatial dimensions. This is "slow" fusion because temporal information is integrated progressively, layer by layer.
Late fusion: spatial receptive field grows slowly through the network; temporal receptive field jumps to the full clip only at the very end. Early fusion: temporal receptive field covers the full clip immediately at layer 1; spatial grows slowly. Slow fusion (3D CNN): both spatial and temporal receptive fields grow gradually. Each layer sees a slightly larger neighborhood in both space and time. This is the most balanced approach.
Consider a tiny architecture: 2 conv layers + global pool. Input: 3 × 20 × 64 × 64. Conv filter size: 3×3 (2D) or 3×3×3 (3D). Pool: 4×4 (2D) or 4×4×4 (3D).
Late fusion: After Conv1 (3×3), temporal RF = 1 (no temporal mixing). After Pool (4×4), temporal RF = 1. After Conv2 (3×3), temporal RF = 1. After GlobalAvgPool, temporal RF = 20. The network only sees time at the very end.
Early fusion: After Conv1, temporal RF = 20 (all frames stacked in channels). But spatial RF = 3. After that, the temporal dimension is gone — it's a flat 2D feature map.
3D CNN: After Conv3D (3×3×3), temporal RF = 3, spatial RF = 3. After Pool3D (4×4×4), temporal RF = 6. After Conv3D (3×3×3), temporal RF = 14. Both dimensions grow together.
Karpathy et al. (2014) tested these strategies on Sports-1M, a dataset of 1 million YouTube videos across 487 sports. The results were surprising:
| Model | Top-5 Accuracy |
|---|---|
| Single Frame | 77.7% |
| Early Fusion | 76.8% |
| Late Fusion | 78.7% |
| 3D CNN (Slow Fusion) | 80.2% |
| C3D (2015) | 84.4% |
The single-frame model was shockingly competitive. Early fusion actually hurt performance compared to single-frame, likely because one layer of temporal processing isn't enough and the collapsed features confuse the rest of the network. It took C3D in 2015 — a deeper 3D CNN — to show the real potential of slow temporal fusion.
Temporal modeling matters, but it's hard to do well. Simple approaches like late fusion and early fusion barely beat (or even lose to) the single-frame baseline. You need to process time gradually and deeply. Shallow temporal processing is almost worse than none at all.
Humans don't just see shapes — we see motion. Classic neuroscience experiments by Johansson (1973) showed that people can recognize actions from nothing but moving dots placed on joints. No texture, no color, no background — just motion. If you see a pattern of dots moving in a walking rhythm, you instantly know it's a person walking.
This insight motivates a completely different approach to video understanding: instead of trying to learn motion from raw frames, compute motion explicitly and feed it as a separate input.
A displacement field F between consecutive frames It and It+1. For each pixel (x, y) in frame t, optical flow gives a vector (dx, dy) indicating where that pixel moves in frame t+1. The constraint: It+1(x + dx, y + dy) ≈ It(x, y). Optical flow is typically stored as two channels: horizontal displacement (dx) and vertical displacement (dy).
Flow highlights motion while suppressing static background. A person running on grass: the RGB frame shows green everywhere (grass and person), but the optical flow shows motion only where the person's limbs move. Flow is a natural complement to appearance.
Simonyan and Zisserman (2014) proposed running two separate CNNs in parallel:
The temporal stream treats flow like an image — the first convolution layer has the full temporal stack (early fusion of flow). But since flow already encodes motion explicitly, this single-layer fusion is much more effective than early fusion on raw RGB.
Task: distinguish "playing tennis" from "playing badminton." The spatial stream sees: person, racket, court. Both sports look similar in a still frame. The temporal stream sees: tennis has wide, sweeping arm motions; badminton has quick wrist flicks. The motion patterns are completely different, even though the appearance is similar. Fusing both streams gives the correct answer when neither alone is sufficient.
| Model | UCF-101 Accuracy |
|---|---|
| 3D CNN (temporal only) | 65.4% |
| Spatial stream only (RGB) | 73.0% |
| Temporal stream only (Flow) | 83.7% |
| Two-stream (average fusion) | 86.9% |
| Two-stream (SVM fusion) | 88.0% |
Look at those numbers: the temporal stream alone (83.7%) crushes the spatial stream alone (73.0%). Motion information is more discriminative than appearance for action recognition. And the two-stream combination (88.0%) shows that appearance and motion provide complementary signals. This is one of the most important insights in video understanding.
Computing optical flow is expensive. Classical methods (Farneback, TV-L1) require iterative optimization per frame pair. Pre-computing flow for a dataset can take longer than training the model itself. Modern methods like FlowNet learn to predict flow with a neural network, but the extra computation and storage remain a significant practical burden. Two-stream networks essentially require pre-processing the entire dataset.
Two-stream networks, like clip-based 3D CNNs, only see short windows of time (~2-5 seconds). For longer-term temporal structure — "first the person picks up a ball, then throws it" — you can feed per-clip CNN features into a recurrent network (LSTM) that processes the sequence of clips.
This CNN+LSTM approach (Donahue et al., 2015) works, but RNNs are slow for long sequences because they process frames sequentially and can't be parallelized. We'll see how transformers solve this problem in Chapter 8.
Ballas et al. (2016) proposed a hybrid: replace the fully-connected recurrence in a vanilla RNN with 2D convolution. The hidden state htL at layer L and time t is computed as: htL = tanh(Wh * ht-1L + Wx * htL-1), where * denotes 2D convolution. This preserves spatial structure while adding temporal recurrence at every layer — combining the infinite temporal extent of RNNs with the spatial locality of CNNs.
Let's build a deeper understanding of what 3D convolution actually does — and why it's fundamentally better than early fusion for temporal modeling.
In early fusion, the first convolution has a weight tensor of shape Cout × Cin × T × Kh × Kw. It slides over spatial positions (x, y) but not over time. The filter spans the full temporal extent in one shot. The output is a 2D feature map: Cout × H × W. Time has been collapsed.
This is a problem. If a blue-to-orange color transition occurs at time t=3, the filter needs different weights than if the same transition occurs at t=10. The network lacks temporal shift-invariance — it must learn separate patterns for the same motion at different times.
A 3D convolution has a weight tensor of shape Cout × Cin × Kt × Kh × Kw, where Kt is a small temporal kernel size (typically 3). Crucially, it slides over all three dimensions: time, height, and width.
The output retains the temporal dimension. A 3D filter that detects "hand moving left" will fire wherever that motion occurs in the clip — early, middle, or late — just as a 2D filter detects "vertical edge" wherever it appears in an image.
This is the key advantage of 3D convolution. A 2D conv with early fusion treats time position 1 differently from position 10 — different weights for the same pattern. A 3D conv shares weights across time, just as 2D conv shares weights across space. A "hand waving" filter fires at any time, just as an "edge" filter fires at any spatial location. The first-layer 3D filters learn interpretable spatiotemporal patterns — you can visualize them as short video clips.
Tran et al. (2015) built C3D, a straightforward extension of VGG-style design to 3D. The recipe: use 3×3×3 convolutions everywhere and 2×2×2 pooling (except Pool1 which is 1×2×2 to avoid collapsing the temporal dimension too early).
| Layer | Output Size (C × T × H × W) | MFLOPs |
|---|---|---|
| Input | 3 × 16 × 112 × 112 | - |
| Conv1 (3×3×3) + Pool1 (1×2×2) | 64 × 16 × 56 × 56 | 1,040 |
| Conv2 (3×3×3) + Pool2 (2×2×2) | 128 × 8 × 28 × 28 | 11,100 |
| Conv3a, Conv3b + Pool3 | 256 × 4 × 14 × 14 | 16,650 |
| Conv4a, Conv4b + Pool4 | 512 × 2 × 7 × 7 | 8,320 |
| Conv5a, Conv5b + Pool5 | 512 × 1 × 3 × 3 | 1,380 |
| FC6, FC7, FC8 | 4096, 4096, C | 1,000 |
C3D pretrained on Sports-1M became a popular video feature extractor — the video equivalent of using ImageNet-pretrained ResNet features for images. Many downstream tasks (video retrieval, action detection, video captioning) used C3D features as their starting point.
C3D requires 39.5 GFLOPs per clip. Compare: AlexNet is 0.7 GFLOPs, VGG-16 is 13.6 GFLOPs. C3D is 2.9× more expensive than VGG, and VGG was already considered heavy. The extra temporal kernel dimension multiplies the compute at every layer. This cost motivated the search for more efficient architectures like (2+1)D convolutions and TSM (Chapter 7).
A 2D conv layer: Cin=64, Cout=128, kernel 3×3. Parameters: 128 × 64 × 3 × 3 = 73,728.
The equivalent 3D conv: Cin=64, Cout=128, kernel 3×3×3. Parameters: 128 × 64 × 3 × 3 × 3 = 221,184. That's 3× more parameters, and the FLOPs increase by an additional factor because the output has a temporal dimension too.
C3D showed that 3D CNNs work, but it was trained from scratch on video. Meanwhile, decades of work had produced excellent 2D CNN architectures (Inception, ResNet) with ImageNet-pretrained weights that encode rich spatial features. Could we reuse that knowledge for video?
Carreira and Zisserman (2017) proposed an elegant trick: inflate any 2D CNN into a 3D CNN by replacing every 2D operation with its 3D counterpart.
Why divide by Kt? Consider what happens if you feed a "constant" video (every frame is the same image) through the inflated 3D network. The 3D convolution sums over the temporal kernel: Kt copies of the same 2D filter, each multiplied by the same input, summed. If each copy has the same weight as the original 2D filter, the output would be Kt times too large. Dividing by Kt ensures the inflated network produces exactly the same output as the original 2D network on static inputs.
2D conv filter: W2D = [[1, 0], [0, 1]] (a 2×2 filter). Input image patch: [[a, b], [c, d]]. 2D output = a + d.
Inflate to 3D with Kt = 3. W3D[t] = W2D/3 for each t. Input: 3 identical frames. 3D output = 3 × (a + d)/3 = a + d. Same result!
Now fine-tune on real video: the temporal weights start equal but gradually diverge. The filter learns to detect changes across frames — exactly what motion detection requires.
The genius of inflation is that you start from a warm start. The inflated network already knows how to detect edges, textures, objects, and scenes from ImageNet pretraining. It just doesn't know about motion yet. Fine-tuning on video data teaches the temporal kernels to diverge from their copies and detect temporal patterns. You get the best of both worlds: ImageNet's spatial features + learned temporal features.
Carreira and Zisserman inflated the Inception-v1 architecture. They compared several approaches on the Kinetics-400 dataset:
| Model | Pretrained On | Top-1 Accuracy |
|---|---|---|
| Per-frame CNN (2D Inception) | ImageNet | 63.3% |
| CNN + LSTM | ImageNet | 62.2% |
| Two-Stream CNN | ImageNet | 65.6% |
| 3D CNN (from scratch) | None | 53.9% |
| 3D CNN (from scratch) | ImageNet (inflated) | 57.9% |
| I3D (RGB only) | ImageNet (inflated) | 68.4% |
| I3D (Two-stream) | ImageNet (inflated) | 74.2% |
A 3D CNN trained from scratch on Kinetics: 53.9%. The same architecture with inflated ImageNet weights: 57.9%. That's a 4 percentage point boost just from better initialization. And the full I3D with inflated weights and more training: 68.4%. Pretraining matters enormously — video datasets are smaller than ImageNet, so starting from good spatial features is crucial.
The original Inception block has parallel branches: 1×1 conv, 3×3 conv, 5×5 conv, and 3×3 max pool. In I3D, these become: 1×1×1 conv, 3×3×3 conv, 5×5×5 conv, and 3×3×3 max pool. The 1×1 bottleneck convolutions become 1×1×1 — they don't mix temporal information, only reduce channel dimensions. The 3×3×3 convolutions handle spatiotemporal feature extraction. Each branch sees a different spatiotemporal window.
I3D inherits the 3D convolution cost problem. An inflated Inception is cheaper than C3D because Inception uses 1×1 bottleneck layers, but it's still significantly more expensive than a 2D model. Running I3D in two-stream mode (RGB + flow) doubles the cost again, plus you need to pre-compute optical flow. The next two chapters address efficiency.
Here's an insight from neuroscience: the primate visual system processes motion with two types of cells. Parvocellular (P-cells) are slow, detailed, and color-sensitive — they build rich representations of what an object looks like. Magnocellular (M-cells) are fast, low-resolution, and motion-sensitive — they rapidly detect temporal changes. The brain runs two parallel streams at different temporal resolutions.
Feichtenhofer et al. (2019) applied this principle directly to network architecture.
A video architecture with two parallel 3D CNN pathways:
Slow pathway: operates at low frame rate (e.g., 4 fps). Uses more channels. Captures rich spatial semantics — what objects are present, their detailed appearance.
Fast pathway: operates at high frame rate (e.g., 32 fps — 8× faster). Uses far fewer channels (α fraction, typically α = 1/8). Captures fine-grained temporal motion. Lightweight because fewer channels means fewer FLOPs.
The genius: the fast pathway only adds about 20% compute because it uses so few channels. But it sees 8× more frames, giving it fine temporal resolution where it matters most — motion detection.
The two pathways aren't independent. Lateral connections feed information from the fast pathway into the slow pathway at multiple resolutions. This lets the slow pathway incorporate motion information without needing to process high-frame-rate input itself.
Input video: 64 frames at 30 fps (~2 seconds). Slow pathway samples every 8th frame: 8 frames, 64 channels. Fast pathway uses all frames at α=1/8: 64 frames, 8 channels.
Slow pathway FLOPs per 3D conv (3×3×3, 64→64): 8 × H' × W' × 64 × 64 × 27 = X.
Fast pathway FLOPs (3×3×3, 8→8): 64 × H' × W' × 8 × 8 × 27 = X × (64/8) × (8/64)2 = X × 8 × (1/64) = X/8.
The fast pathway is 8× cheaper per layer despite processing 8× more frames. Channel count dominates compute.
You might think: "just feed more frames into a single 3D CNN." But doubling the frame rate doubles FLOPs linearly, and the network wastes capacity processing detailed spatial features at high temporal resolution. SlowFast's insight is that appearance needs high channels but few frames, while motion needs many frames but few channels. Splitting these responsibilities saves compute while improving accuracy.
SlowFast with a ResNet-101 backbone and non-local attention blocks achieved 79.8% top-1 on Kinetics-400, a significant jump from I3D's 74.2%. The non-local blocks (spatio-temporal self-attention inserted between 3D conv layers) help capture long-range dependencies that local 3D convolutions miss.
Given features C × T × H × W, project to queries, keys, and values via 1×1×1 conv. Compute attention over all T×H×W positions. Each position attends to every other spatiotemporal position. Added as residual blocks into existing 3D CNNs. Equivalent to self-attention on a flattened sequence of T×H×W tokens. Cost: O((THW)2) — expensive but effective.
3D convolutions are powerful but expensive. Researchers have found clever ways to approximate their behavior at a fraction of the cost.
Instead of a single 3D convolution with kernel Kt×Kh×Kw, decompose it into two sequential operations:
This factorization doubles the number of nonlinearities (a ReLU after each sub-convolution) while having roughly the same parameter count. Tran et al. (2018) showed R(2+1)D matched or exceeded full 3D ResNets on Kinetics.
Full 3D conv: Cin=64, Cout=64, kernel 3×3×3. Params: 64 × 64 × 27 = 110,592.
(2+1)D with M=64: Spatial (64×64×9) + Temporal (64×64×3) = 36,864 + 12,288 = 49,152. That's 55% fewer parameters for the same receptive field. The extra ReLU between spatial and temporal convolution adds representational power for free.
Lin et al. (2019) asked a radical question: what if we could capture temporal information using zero additional parameters and zero extra FLOPs?
Given a feature tensor of shape C × T × H × W, TSM shifts a portion of the channels along the time axis. Specifically: shift the first C/8 channels forward by one frame, shift the next C/8 channels backward by one frame, and leave the remaining 3C/4 channels unchanged. This creates temporal mixing between adjacent frames at zero cost — it's just a memory move, not a computation.
When you insert TSM before a standard 2D convolution, the 2D conv now processes a mixture of features from the current frame, the previous frame, and the next frame. The spatial convolution implicitly becomes a spatiotemporal operation — without any 3D weights.
TSM is profound because it means any 2D CNN (ResNet, MobileNet, EfficientNet) can be converted to a video model by simply inserting channel shifts before each residual block. No architecture changes. No new parameters. No new FLOPs. A TSM-ResNet-50 achieves comparable accuracy to 3D ResNets on Kinetics while running at the same speed as a 2D ResNet-50. This makes it especially attractive for deployment on mobile devices and edge hardware.
Frame t-1 has a ball on the left. Frame t has the ball in the center. Frame t+1 has the ball on the right. After TSM, the feature map at time t contains: some channels from t-1 (ball-left features), most channels from t (ball-center features), and some channels from t+1 (ball-right features).
When the 2D conv processes this mixed feature map, it sees ball-left, ball-center, and ball-right information simultaneously. A filter that responds to "ball moving right" will fire — exactly like a temporal convolution would, but without 3D weights.
| Model | GFLOPs | Kinetics-400 Top-1 | Extra Params vs 2D |
|---|---|---|---|
| ResNet-50 (2D, per-frame) | ~4 | ~63% | 0 |
| C3D | 39.5 | ~65% | All new |
| I3D (Inception) | ~108 | 71.1% | Inflated |
| R(2+1)D-34 | ~152 | 72.0% | Factorized |
| TSM-ResNet-50 | ~4 × T | ~74% | 0 (zero) |
| SlowFast R-101+NL | ~234 | 79.8% | Dual path |
TSM only shifts channels by one frame. Its temporal receptive field grows linearly with depth (one more frame per ResNet block), not exponentially like 3D pooling. For actions requiring very long temporal reasoning, TSM may underperform heavier models. Also, TSM doesn't help the first layer — there are no features to shift yet. For maximum accuracy, you still want 3D convolutions or transformers. TSM shines when compute budget is tight.
Self-attention doesn't care about local neighborhoods. It compares every position to every other position. For video, that means every pixel at every frame can attend to every pixel at every other frame. In principle, this gives infinite receptive field from layer one — no gradual buildup needed.
But there's a problem: a video clip with T=8, H=14, W=14 (after patch embedding) has T × H × W = 1,568 tokens. Full self-attention is O(N2), so the attention matrix has 1,5682 ≈ 2.5 million entries. For longer clips or higher resolution, this explodes.
Bertasius et al. (2021) proposed Divided Space-Time Attention: instead of full spatiotemporal attention, alternate between two types of attention in each transformer block:
Consider a patch at position (h=5, w=7) in a clip with T=8 frames. In temporal attention, this patch attends to the 8 patches at (5, 7) across all frames — comparing "what does this same location look like over time?" This captures motion at that position. In spatial attention, the patch at frame t=3 attends to all 14×14=196 patches within frame 3 — standard image self-attention. By alternating, the model builds spatiotemporal representations without the quadratic blowup.
Arnab et al. (2021) systematically compared four ways to factorize video transformers:
| Model | Strategy | Description |
|---|---|---|
| ViViT-1 | Spatio-temporal tokens | Full attention over all T×H×W tokens. Most accurate, most expensive. |
| ViViT-2 | Factorized encoder | Spatial transformer first, then temporal transformer on CLS tokens. |
| ViViT-3 | Factorized self-attention | TimeSformer-style: alternate spatial and temporal attention in each block. |
| ViViT-4 | Factorized dot-product | Factorize the attention matrix itself: A = Aspatial × Atemporal. |
ViViT introduces tubelet embedding: instead of embedding individual 2D patches, embed 3D tubes (t × h × w) from the video volume. A tubelet of size 2×16×16 means each token represents 2 frames and 16×16 pixels. This reduces the number of tokens (fewer = cheaper attention) while giving each token temporal context from the start. Think of it as 3D patch embedding — the video equivalent of ViT's 2D patch embedding.
Tong et al. (2022) extended Masked Autoencoders (MAE) to video. The key insight: video is highly redundant. Consecutive frames are nearly identical. You can mask a very high ratio of tubes (90-95%) and still reconstruct the video, because adjacent visible tubes provide strong clues.
A self-supervised pretraining method for video transformers. Divide the video into spatiotemporal tubes. Randomly mask 90-95% of them. Feed the visible tubes into a ViT encoder. Reconstruct the masked tubes with a lightweight decoder. The encoder learns rich spatiotemporal representations without any labeled data.
VideoMAE V2 (Wang et al., 2023) scaled this approach to ViT-giant with 1 billion parameters, achieving 90.0% top-1 on Kinetics-400 — the current state of the art for single models. The high masking ratio makes pretraining efficient: the encoder only processes 5-10% of tokens, dramatically reducing FLOPs and memory.
Input: 16 frames, 14×14 spatial grid, tubelet size 2×16×16. Total tubes: (16/2) × 14 × 14 = 1,568. At 90% masking: only 157 visible tubes are processed by the encoder. The encoder's self-attention cost drops from O(15682) to O(1572) — a 100× reduction. This makes it practical to pretrain very large models on massive video datasets.
| Model | Year | Kinetics-400 Top-1 | Architecture |
|---|---|---|---|
| Per-frame CNN | 2014 | 63.3% | 2D Inception |
| Two-Stream I3D | 2017 | 74.2% | Inflated Inception |
| SlowFast R-101+NL | 2019 | 79.8% | Dual-path 3D ResNet |
| MViTv2-L | 2022 | 86.1% | Multiscale ViT |
| VideoMAE V2-g | 2023 | 90.0% | ViT-giant + MAE |
The progression is clear: 63% → 74% → 80% → 86% → 90%. Each major jump came from a paradigm shift: 2D CNNs → 3D CNNs → efficient 3D architectures → vision transformers → self-supervised pretraining. The current best (VideoMAE V2) combines ViT's global attention with masked autoencoder pretraining at enormous scale. But notice: even the 2017 I3D was already 74%. The last 16 percentage points took six more years of research.
Let's bring everything together. Below is an interactive visualization showing how each video architecture processes temporal information differently. Select an architecture to see its computational pattern, accuracy, cost, and key innovation.
Every architecture in this lecture makes a specific trade-off along three axes:
| Architecture | Temporal Modeling | Compute Cost | Key Innovation |
|---|---|---|---|
| Single Frame | None (avg at test) | Very low | Baseline |
| Two-Stream | Optical flow stream | Medium + flow | Explicit motion input |
| C3D | 3D conv throughout | High (39.5G) | VGG-style 3D |
| I3D | Inflated 3D conv | High (108G) | ImageNet weight transfer |
| SlowFast | Dual temporal rate | High (234G) | Slow + fast pathways |
| TSM | Channel shifting | Very low (= 2D) | Zero-cost temporal mixing |
| TimeSformer | Factorized attention | Medium-high | Divided space-time attn |
| VideoMAE V2 | Full attention + MAE | High (pretraining) | 90% masking ratio |
Compute-limited? Use TSM — it turns any 2D CNN into a video model at zero extra cost. Need maximum accuracy? VideoMAE V2 with a ViT-giant backbone. Want a good middle ground? SlowFast or MViTv2. Need interpretable motion features? Two-stream with optical flow. The right answer depends on your deploy environment, dataset size, and latency requirements.
2014: Karpathy's fusion experiments show single-frame is embarrassingly strong. 2014: Simonyan's two-stream network proves motion matters. 2015: C3D shows 3D convolutions can learn temporal features end-to-end. 2017: I3D shows ImageNet pretraining transfers to video. 2019: SlowFast shows asymmetric temporal resolution is better. 2019: TSM shows you can do temporal modeling for free. 2021: TimeSformer/ViViT bring transformers to video. 2022-23: VideoMAE shows self-supervised pretraining at scale dominates everything else.
Video doesn't exist in a vacuum. Real videos have audio — speech, music, ambient sounds — that provides powerful complementary signals. A person's lips move in sync with their voice. Musical instruments produce characteristic sounds. Dogs bark and birds chirp. Multimodal video understanding fuses visual and audio streams for richer scene comprehension.
The McGurk effect demonstrates how deeply intertwined vision and audio are in human perception. When you hear "ba" while watching lips say "fa," you perceive "da" — your brain fuses the conflicting signals into a third percept. Neural networks can similarly learn to fuse audio and video.
Given a video with a mixture of sounds (two people speaking, multiple instruments playing), separate the audio into individual sources using the visual signal as guidance. If you can see a person's lips moving, you can isolate their voice from the mixture. If you can see a guitar being strummed, you can separate the guitar audio from the violin playing simultaneously.
Gao et al. (2021) demonstrated this with VisualVoice: given a video of two people speaking simultaneously, the model separates their voices by using each person's face and lip movements to guide the audio separation. The visual modality acts as a "query" that tells the audio model whose voice to extract.
For long-form videos (minutes to hours), processing every frame is impractical even with efficient models. Several strategies address this:
Salient clip sampling (Korbar et al., 2019): Instead of uniform sampling, predict which clips are most informative and only process those. Use a lightweight "preview" model to score candidate clips.
Audio as preview (Gao et al., 2020): Use audio features to decide which video frames to process. Audio is 1000× cheaper to process than video. If the audio suggests "someone is speaking," process the corresponding frames; if "silence," skip them.
Mobile architectures (MoViNets, Kondratyuk et al., 2021): Stream-based architectures that process one frame at a time using causal temporal convolutions, maintaining a fixed-size hidden state. No need to buffer entire clips.
Video understanding extends well beyond action classification:
| Task | Input | Output |
|---|---|---|
| Temporal action localization | Long untrimmed video | Start/end times + labels for each action |
| Spatio-temporal detection | Video sequence | Bounding boxes + actions per person per frame |
| Video captioning | Video clip | Natural language description |
| Video question answering | Video + question text | Natural language answer |
| Egocentric understanding | First-person video | Activity/object/interaction recognition |
The latest frontier: combining video encoders with large language models. Systems like Video-LLaVA (Lin et al., 2024) and VideoLLaMA 3 (Zhang et al., 2025) embed video frames using a ViT encoder, project the visual tokens into the LLM's embedding space, and let the LLM reason about the video using its language capabilities. This enables open-ended video understanding: "What is happening in this video?" "Why did the person look surprised?" "What will happen next?"
Video understanding has followed a clear arc: hand-crafted motion (optical flow) → learned 3D features (C3D/I3D) → efficient temporal modeling (SlowFast/TSM) → global attention (transformers) → self-supervised pretraining (VideoMAE) → multimodal reasoning (Video-LLMs). Each step increased both the temporal reasoning capacity and the generality of the representations. The field is converging toward unified models that see, hear, and reason about video in natural language.
| Concept | Key Insight |
|---|---|
| Temporal modeling | Late fusion is lazy; slow fusion (3D conv) builds temporal features gradually |
| Motion representation | Optical flow provides explicit motion; 3D conv learns implicit motion |
| Weight reuse | Inflate 2D ImageNet weights to 3D — warm start beats cold start |
| Asymmetric design | Appearance needs channels, motion needs frames (SlowFast) |
| Zero-cost temporal | Channel shifting (TSM) gives temporal modeling for free |
| Attention factorization | Separate spatial and temporal attention avoids O(N2) blowup |
| Self-supervised scaling | 90% masking + large ViTs = state-of-the-art without labels |
Lecture connections: This lecture builds on CNN architectures (Lecture 5), training and optimization (Lecture 6), and detection/segmentation (Lecture 9). The transformer-based approaches build on the Vision Transformer (ViT) covered in Lecture 8.
Karpathy et al., "Large-scale Video Classification with Convolutional Neural Networks," CVPR 2014.
Simonyan & Zisserman, "Two-Stream Convolutional Networks for Action Recognition," NeurIPS 2014.
Tran et al., "Learning Spatiotemporal Features with 3D Convolutional Networks (C3D)," ICCV 2015.
Carreira & Zisserman, "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D)," CVPR 2017.
Feichtenhofer et al., "SlowFast Networks for Video Recognition," ICCV 2019.
Lin et al., "TSM: Temporal Shift Module for Efficient Video Understanding," ICCV 2019.
Bertasius et al., "Is Space-Time Attention All You Need for Video Understanding? (TimeSformer)," ICML 2021.
Tong et al., "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training," NeurIPS 2022.