CS 231n — Video Understanding: From Frames to Temporal Reasoning

Roadmap

What You'll Master

01From Images to Video 02Simple Approaches & Fusion 03Two-Stream Networks 043D Convolutions & C3D 05Inflated 3D Networks (I3D) 06SlowFast Networks 07Efficient Video Models 08Video Transformers 09Architecture Comparison 10Multimodal & Connections

Chapter 01

From Images to Video

You already know how to classify images. A CNN looks at a 3×H×W tensor (three color channels, height, width) and outputs a label: "cat," "truck," "airplane." But the real world isn't frozen. A person isn't just standing — they're running, jumping, throwing. To recognize actions, you need to see how things change over time.

A video is just a sequence of images — frames — stacked along a new axis: time. So instead of a 3D tensor (3×H×W), a video clip is a 4D tensor: T×3×H×W, where T is the number of frames.

Video as a 4D Tensor Video clip: T × 3 × H × W
T = number of frames, 3 = RGB channels, H × W = spatial resolution

That extra dimension — time — is what makes video understanding both powerful and brutally expensive.

The Scale Problem

Videos are recorded at roughly 30 frames per second. A single minute of uncompressed HD video (1920×1080) weighs about 10 GB. Even standard definition (640×480) is ~1.5 GB per minute. You cannot feed raw video into a neural network. The GPU would run out of memory before processing the first second.

Worked Example — Why Videos Are Huge

One HD frame: 1920 × 1080 × 3 bytes = 6.2 MB. At 30 fps, one second = 186 MB. One minute = 11.2 GB. Compare to a single ImageNet image: 224 × 224 × 3 = 150 KB. A one-minute video is roughly 75,000× larger than one ImageNet image.

The solution: train on short clips at low frame rate and low spatial resolution. A typical setup: T=16 frames sampled at 5 fps, with H=W=112. That's 3.2 seconds of content in only 588 KB — manageable.

Training on Clips

During training, you sample random short clips from long videos. Each clip gets a label (the action class for that segment). During testing, you run the model on multiple clips from the same video and average the predictions. This "clip-then-aggregate" strategy is universal in video classification.

The Central Question

Given a clip of T frames, how should a neural network process the temporal dimension? Should it look at frames independently? Should it fuse them early, late, or gradually? Should it use 2D convolutions, 3D convolutions, or attention? Every architecture in this lecture answers this question differently.

Definition

Video Classification

Given a video clip (T × 3 × H × W), predict an action label such as "running," "swimming," or "playing guitar." This is the video analog of image classification, but the label depends on temporal patterns, not just appearance.

Images vs. Videos: What Changes

Property	Image Classification	Video Classification
Input shape	3 × H × W	T × 3 × H × W
Recognizes	Objects (dog, car, tree)	Actions (running, jumping, eating)
Key signal	Spatial features	Spatial + temporal features
Compute cost	~10 GFLOPs (ResNet-50)	~100+ GFLOPs (typical video model)
Data size	~150 KB per image	~600 KB per clip (low-res)

The Embarrassing Baseline

Here's a humbling fact: a single-frame CNN — one that classifies each frame independently and averages the predictions — is often a very strong baseline for video classification. On many benchmarks, it performs within a few percentage points of complex temporal models. This tells us that many "action recognition" datasets can be solved mostly by appearance (a person holding a tennis racket is probably playing tennis). True temporal reasoning is harder to benchmark than it sounds.

Chapter 02

Simple Approaches: Temporal Fusion

The simplest approach to video: ignore time entirely. Run a 2D CNN on each frame independently, average the class probabilities, and call it a day. This is the single-frame CNN baseline, and as we just noted, it's embarrassingly competitive.

But if you want to do better, you need to fuse temporal information somehow. The question is when in the network to combine frames. This gives us three strategies, first explored systematically by Karpathy et al. (2014).

Late Fusion

Run a 2D CNN on each frame independently to extract per-frame features. Then combine the features at the very end — either by flattening and feeding to an MLP, or by average-pooling across frames and time.

Late Fusion Pipeline Input: T × 3 × H × W
↓ 2D CNN on each frame independently
Frame features: T × D × H' × W'
↓ Average pool over space and time
Clip features: D
↓ Linear classifier
Class scores: C

The advantage: you reuse a standard pretrained 2D CNN (like ResNet). The problem: the CNN never compares frames directly. It can't detect low-level motion patterns because each frame is processed in isolation until the very end.

Early Fusion

Stack all T frames along the channel dimension to create a single "fat" image of shape (3T) × H × W. Feed this directly into a standard 2D CNN. The very first convolution layer now has access to all temporal information.

Early Fusion Pipeline Input: T × 3 × H × W → Reshape to (3T) × H × W
↓ First Conv2D: (3T) × H × W → D × H × W
↓ Rest of standard 2D CNN
Class scores: C

All temporal info collapses in the first layer — one shot to capture motion.

The advantage: the first layer can compare pixels across frames and detect motion. The problem: one convolutional layer may not be enough temporal processing. After the first layer, the network is a regular 2D CNN with no temporal dimension left.

Slow Fusion (3D CNN)

Use 3D convolutions and 3D pooling throughout the network. Each layer operates on a 4D tensor (D × T × H × W), gradually reducing the temporal dimension alongside the spatial dimensions. This is "slow" fusion because temporal information is integrated progressively, layer by layer.

Slow Fusion (3D CNN) Pipeline Input: 3 × T × H × W
↓ Conv3D (3×3×3): D₁ × T × H × W
↓ Pool3D (4×4×4): D₁ × T/4 × H/4 × W/4
↓ Conv3D (3×3×3): D₂ × T/4 × H/4 × W/4
↓ GlobalAvgPool
Class scores: C

How Receptive Fields Grow

Late fusion: spatial receptive field grows slowly through the network; temporal receptive field jumps to the full clip only at the very end. Early fusion: temporal receptive field covers the full clip immediately at layer 1; spatial grows slowly. Slow fusion (3D CNN): both spatial and temporal receptive fields grow gradually. Each layer sees a slightly larger neighborhood in both space and time. This is the most balanced approach.

Interactive: Fusion Strategies Compared

Toggle between strategies to see how temporal information flows through the network.

Worked Example — Receptive Fields at Each Layer

Consider a tiny architecture: 2 conv layers + global pool. Input: 3 × 20 × 64 × 64. Conv filter size: 3×3 (2D) or 3×3×3 (3D). Pool: 4×4 (2D) or 4×4×4 (3D).

Late fusion: After Conv1 (3×3), temporal RF = 1 (no temporal mixing). After Pool (4×4), temporal RF = 1. After Conv2 (3×3), temporal RF = 1. After GlobalAvgPool, temporal RF = 20. The network only sees time at the very end.

Early fusion: After Conv1, temporal RF = 20 (all frames stacked in channels). But spatial RF = 3. After that, the temporal dimension is gone — it's a flat 2D feature map.

3D CNN: After Conv3D (3×3×3), temporal RF = 3, spatial RF = 3. After Pool3D (4×4×4), temporal RF = 6. After Conv3D (3×3×3), temporal RF = 14. Both dimensions grow together.

Results on Sports-1M

Karpathy et al. (2014) tested these strategies on Sports-1M, a dataset of 1 million YouTube videos across 487 sports. The results were surprising:

Model	Top-5 Accuracy
Single Frame	77.7%
Early Fusion	76.8%
Late Fusion	78.7%
3D CNN (Slow Fusion)	80.2%
C3D (2015)	84.4%

The single-frame model was shockingly competitive. Early fusion actually hurt performance compared to single-frame, likely because one layer of temporal processing isn't enough and the collapsed features confuse the rest of the network. It took C3D in 2015 — a deeper 3D CNN — to show the real potential of slow temporal fusion.

The Lesson from 2014

Temporal modeling matters, but it's hard to do well. Simple approaches like late fusion and early fusion barely beat (or even lose to) the single-frame baseline. You need to process time gradually and deeply. Shallow temporal processing is almost worse than none at all.

Chapter 03

Two-Stream Networks

Humans don't just see shapes — we see motion. Classic neuroscience experiments by Johansson (1973) showed that people can recognize actions from nothing but moving dots placed on joints. No texture, no color, no background — just motion. If you see a pattern of dots moving in a walking rhythm, you instantly know it's a person walking.

This insight motivates a completely different approach to video understanding: instead of trying to learn motion from raw frames, compute motion explicitly and feed it as a separate input.

Optical Flow

Definition

Optical Flow

A displacement field F between consecutive frames I_t and I_t+1. For each pixel (x, y) in frame t, optical flow gives a vector (dx, dy) indicating where that pixel moves in frame t+1. The constraint: I_t+1(x + dx, y + dy) ≈ I_t(x, y). Optical flow is typically stored as two channels: horizontal displacement (dx) and vertical displacement (dy).

Optical Flow Constraint F(x, y) = (dx, dy) such that I_t+1(x + dx, y + dy) = I_t(x, y)
Each pixel gets a 2D motion vector. The full field is an H × W × 2 tensor.

Flow highlights motion while suppressing static background. A person running on grass: the RGB frame shows green everywhere (grass and person), but the optical flow shows motion only where the person's limbs move. Flow is a natural complement to appearance.

The Two-Stream Architecture

Simonyan and Zisserman (2014) proposed running two separate CNNs in parallel:

Two-Stream Network

Spatial stream: A standard 2D CNN that takes a single RGB frame (3 × H × W) and learns appearance features: "there's a person," "there's a racket," "this looks like a swimming pool."
Temporal stream: A 2D CNN that takes a stack of optical flow images ([2×(T−1)] × H × W) and learns motion features: "something is moving left-to-right," "arms are swinging."
Fusion: Average the class probabilities from both streams (or train an SVM on their concatenated features).

The temporal stream treats flow like an image — the first convolution layer has the full temporal stack (early fusion of flow). But since flow already encodes motion explicitly, this single-layer fusion is much more effective than early fusion on raw RGB.

Worked Example — Two-Stream Feature Separation

Task: distinguish "playing tennis" from "playing badminton." The spatial stream sees: person, racket, court. Both sports look similar in a still frame. The temporal stream sees: tennis has wide, sweeping arm motions; badminton has quick wrist flicks. The motion patterns are completely different, even though the appearance is similar. Fusing both streams gives the correct answer when neither alone is sufficient.

Results: Motion Matters

Model	UCF-101 Accuracy
3D CNN (temporal only)	65.4%
Spatial stream only (RGB)	73.0%
Temporal stream only (Flow)	83.7%
Two-stream (average fusion)	86.9%
Two-stream (SVM fusion)	88.0%

Flow Beats Appearance

Look at those numbers: the temporal stream alone (83.7%) crushes the spatial stream alone (73.0%). Motion information is more discriminative than appearance for action recognition. And the two-stream combination (88.0%) shows that appearance and motion provide complementary signals. This is one of the most important insights in video understanding.

The Cost of Optical Flow

Computing optical flow is expensive. Classical methods (Farneback, TV-L1) require iterative optimization per frame pair. Pre-computing flow for a dataset can take longer than training the model itself. Modern methods like FlowNet learn to predict flow with a neural network, but the extra computation and storage remain a significant practical burden. Two-stream networks essentially require pre-processing the entire dataset.

Long-Term Temporal Modeling with RNNs

Two-stream networks, like clip-based 3D CNNs, only see short windows of time (~2-5 seconds). For longer-term temporal structure — "first the person picks up a ball, then throws it" — you can feed per-clip CNN features into a recurrent network (LSTM) that processes the sequence of clips.

This CNN+LSTM approach (Donahue et al., 2015) works, but RNNs are slow for long sequences because they process frames sequentially and can't be parallelized. We'll see how transformers solve this problem in Chapter 8.

Recurrent Convolutional Networks

Ballas et al. (2016) proposed a hybrid: replace the fully-connected recurrence in a vanilla RNN with 2D convolution. The hidden state h_t^L at layer L and time t is computed as: h_t^L = tanh(W_h * h_t-1^L + W_x * h_t^L-1), where * denotes 2D convolution. This preserves spatial structure while adding temporal recurrence at every layer — combining the infinite temporal extent of RNNs with the spatial locality of CNNs.

Chapter 04

3D Convolutions & C3D

Let's build a deeper understanding of what 3D convolution actually does — and why it's fundamentally better than early fusion for temporal modeling.

2D Conv (Early Fusion) vs 3D Conv

In early fusion, the first convolution has a weight tensor of shape C_out × C_in × T × K_h × K_w. It slides over spatial positions (x, y) but not over time. The filter spans the full temporal extent in one shot. The output is a 2D feature map: C_out × H × W. Time has been collapsed.

2D Conv (Early Fusion) Weight: C_out × C_in × T × K_h × K_w
Input: C_in × T × H × W → Output: C_out × H × W

Slides over (x, y) only. Time is collapsed into channels. No temporal shift-invariance.

This is a problem. If a blue-to-orange color transition occurs at time t=3, the filter needs different weights than if the same transition occurs at t=10. The network lacks temporal shift-invariance — it must learn separate patterns for the same motion at different times.

A 3D convolution has a weight tensor of shape C_out × C_in × K_t × K_h × K_w, where K_t is a small temporal kernel size (typically 3). Crucially, it slides over all three dimensions: time, height, and width.

3D Convolution Weight: C_out × C_in × K_t × K_h × K_w
Input: C_in × T × H × W → Output: C_out × T × H × W

Slides over (t, x, y). Temporal dimension is PRESERVED. Shift-invariant in time.

The output retains the temporal dimension. A 3D filter that detects "hand moving left" will fire wherever that motion occurs in the clip — early, middle, or late — just as a 2D filter detects "vertical edge" wherever it appears in an image.

Interactive: 3D Convolution Filter Sliding Over a Video Volume

Watch a 3×3×3 filter slide across a T×H×W video volume. Click "Step" to advance the filter position.

Temporal Shift-Invariance

This is the key advantage of 3D convolution. A 2D conv with early fusion treats time position 1 differently from position 10 — different weights for the same pattern. A 3D conv shares weights across time, just as 2D conv shares weights across space. A "hand waving" filter fires at any time, just as an "edge" filter fires at any spatial location. The first-layer 3D filters learn interpretable spatiotemporal patterns — you can visualize them as short video clips.

C3D: The VGG of 3D CNNs

Tran et al. (2015) built C3D, a straightforward extension of VGG-style design to 3D. The recipe: use 3×3×3 convolutions everywhere and 2×2×2 pooling (except Pool1 which is 1×2×2 to avoid collapsing the temporal dimension too early).

Layer	Output Size (C × T × H × W)	MFLOPs
Input	3 × 16 × 112 × 112	-
Conv1 (3×3×3) + Pool1 (1×2×2)	64 × 16 × 56 × 56	1,040
Conv2 (3×3×3) + Pool2 (2×2×2)	128 × 8 × 28 × 28	11,100
Conv3a, Conv3b + Pool3	256 × 4 × 14 × 14	16,650
Conv4a, Conv4b + Pool4	512 × 2 × 7 × 7	8,320
Conv5a, Conv5b + Pool5	512 × 1 × 3 × 3	1,380
FC6, FC7, FC8	4096, 4096, C	1,000

C3D pretrained on Sports-1M became a popular video feature extractor — the video equivalent of using ImageNet-pretrained ResNet features for images. Many downstream tasks (video retrieval, action detection, video captioning) used C3D features as their starting point.

3D Conv Is Expensive

C3D requires 39.5 GFLOPs per clip. Compare: AlexNet is 0.7 GFLOPs, VGG-16 is 13.6 GFLOPs. C3D is 2.9× more expensive than VGG, and VGG was already considered heavy. The extra temporal kernel dimension multiplies the compute at every layer. This cost motivated the search for more efficient architectures like (2+1)D convolutions and TSM (Chapter 7).

Worked Example — Parameter Count of 3D vs 2D Conv

A 2D conv layer: C_in=64, C_out=128, kernel 3×3. Parameters: 128 × 64 × 3 × 3 = 73,728.

The equivalent 3D conv: C_in=64, C_out=128, kernel 3×3×3. Parameters: 128 × 64 × 3 × 3 × 3 = 221,184. That's 3× more parameters, and the FLOPs increase by an additional factor because the output has a temporal dimension too.

Chapter 05

Inflated 3D Networks (I3D)

C3D showed that 3D CNNs work, but it was trained from scratch on video. Meanwhile, decades of work had produced excellent 2D CNN architectures (Inception, ResNet) with ImageNet-pretrained weights that encode rich spatial features. Could we reuse that knowledge for video?

Carreira and Zisserman (2017) proposed an elegant trick: inflate any 2D CNN into a 3D CNN by replacing every 2D operation with its 3D counterpart.

The Inflation Recipe

2D-to-3D Inflation

Architecture: Take any 2D CNN. Replace every K_h×K_w convolution with a K_t×K_h×K_w convolution. Replace every 2D pooling with 3D pooling.
Weight initialization: For each 2D filter of shape C_in×K_h×K_w, create a 3D filter by duplicating the 2D filter K_t times along the temporal axis, then dividing by K_t.

Weight Inflation W_3D[:, :, t, :, :] = W_2D[:, :, :, :] / K_t for t = 0, ..., K_t−1

Copy the 2D filter along time, divide by K_t to preserve output magnitude.

Why divide by K_t? Consider what happens if you feed a "constant" video (every frame is the same image) through the inflated 3D network. The 3D convolution sums over the temporal kernel: K_t copies of the same 2D filter, each multiplied by the same input, summed. If each copy has the same weight as the original 2D filter, the output would be K_t times too large. Dividing by K_t ensures the inflated network produces exactly the same output as the original 2D network on static inputs.

Worked Example — Inflation Preserves Output

2D conv filter: W_2D = [[1, 0], [0, 1]] (a 2×2 filter). Input image patch: [[a, b], [c, d]]. 2D output = a + d.

Inflate to 3D with K_t = 3. W_3D[t] = W_2D/3 for each t. Input: 3 identical frames. 3D output = 3 × (a + d)/3 = a + d. Same result!

Now fine-tune on real video: the temporal weights start equal but gradually diverge. The filter learns to detect changes across frames — exactly what motion detection requires.

Bootstrapping from ImageNet

The genius of inflation is that you start from a warm start. The inflated network already knows how to detect edges, textures, objects, and scenes from ImageNet pretraining. It just doesn't know about motion yet. Fine-tuning on video data teaches the temporal kernels to diverge from their copies and detect temporal patterns. You get the best of both worlds: ImageNet's spatial features + learned temporal features.

I3D Results

Carreira and Zisserman inflated the Inception-v1 architecture. They compared several approaches on the Kinetics-400 dataset:

Model	Pretrained On	Top-1 Accuracy
Per-frame CNN (2D Inception)	ImageNet	63.3%
CNN + LSTM	ImageNet	62.2%
Two-Stream CNN	ImageNet	65.6%
3D CNN (from scratch)	None	53.9%
3D CNN (from scratch)	ImageNet (inflated)	57.9%
I3D (RGB only)	ImageNet (inflated)	68.4%
I3D (Two-stream)	ImageNet (inflated)	74.2%

Worked Example — The Inflation Boost

A 3D CNN trained from scratch on Kinetics: 53.9%. The same architecture with inflated ImageNet weights: 57.9%. That's a 4 percentage point boost just from better initialization. And the full I3D with inflated weights and more training: 68.4%. Pretraining matters enormously — video datasets are smaller than ImageNet, so starting from good spatial features is crucial.

Inception Block: Original vs. Inflated

The original Inception block has parallel branches: 1×1 conv, 3×3 conv, 5×5 conv, and 3×3 max pool. In I3D, these become: 1×1×1 conv, 3×3×3 conv, 5×5×5 conv, and 3×3×3 max pool. The 1×1 bottleneck convolutions become 1×1×1 — they don't mix temporal information, only reduce channel dimensions. The 3×3×3 convolutions handle spatiotemporal feature extraction. Each branch sees a different spatiotemporal window.

Still Expensive

I3D inherits the 3D convolution cost problem. An inflated Inception is cheaper than C3D because Inception uses 1×1 bottleneck layers, but it's still significantly more expensive than a 2D model. Running I3D in two-stream mode (RGB + flow) doubles the cost again, plus you need to pre-compute optical flow. The next two chapters address efficiency.

Chapter 06

SlowFast Networks

Here's an insight from neuroscience: the primate visual system processes motion with two types of cells. Parvocellular (P-cells) are slow, detailed, and color-sensitive — they build rich representations of what an object looks like. Magnocellular (M-cells) are fast, low-resolution, and motion-sensitive — they rapidly detect temporal changes. The brain runs two parallel streams at different temporal resolutions.

Feichtenhofer et al. (2019) applied this principle directly to network architecture.

Dual Pathways

Definition

SlowFast Network

A video architecture with two parallel 3D CNN pathways:
Slow pathway: operates at low frame rate (e.g., 4 fps). Uses more channels. Captures rich spatial semantics — what objects are present, their detailed appearance.
Fast pathway: operates at high frame rate (e.g., 32 fps — 8× faster). Uses far fewer channels (α fraction, typically α = 1/8). Captures fine-grained temporal motion. Lightweight because fewer channels means fewer FLOPs.

SlowFast Design Slow pathway: T_slow frames, C channels → ~80% of total compute
Fast pathway: T_fast = τ × T_slow frames, β × C channels
Typical: τ = 8 (8× more frames), β = 1/8 (8× fewer channels)

Fast pathway FLOPs ≈ τ × β² × Slow FLOPs = 8 × (1/64) × Slow ≈ 12.5% of Slow. Total overhead: only ~20% extra compute.

The genius: the fast pathway only adds about 20% compute because it uses so few channels. But it sees 8× more frames, giving it fine temporal resolution where it matters most — motion detection.

Lateral Connections

The two pathways aren't independent. Lateral connections feed information from the fast pathway into the slow pathway at multiple resolutions. This lets the slow pathway incorporate motion information without needing to process high-frame-rate input itself.

Lateral Connection Options

Time-to-channel: Reshape the fast pathway's T_fast × C_fast features into T_slow × (αT × C_fast) by reshaping extra time steps into the channel dimension. Then concatenate with slow features.
Time strided convolution: Apply a 3D conv with stride τ in time to the fast features, reducing T_fast to T_slow. Then concatenate.

Interactive: SlowFast Dual Pathway Architecture

See how slow and fast pathways process different temporal resolutions. Click "Animate" to watch frames flow through both pathways with lateral connections.

Worked Example — SlowFast Compute Breakdown

Input video: 64 frames at 30 fps (~2 seconds). Slow pathway samples every 8th frame: 8 frames, 64 channels. Fast pathway uses all frames at α=1/8: 64 frames, 8 channels.

Slow pathway FLOPs per 3D conv (3×3×3, 64→64): 8 × H' × W' × 64 × 64 × 27 = X.

Fast pathway FLOPs (3×3×3, 8→8): 64 × H' × W' × 8 × 8 × 27 = X × (64/8) × (8/64)² = X × 8 × (1/64) = X/8.

The fast pathway is 8× cheaper per layer despite processing 8× more frames. Channel count dominates compute.

Why Not Just Use More Frames?

You might think: "just feed more frames into a single 3D CNN." But doubling the frame rate doubles FLOPs linearly, and the network wastes capacity processing detailed spatial features at high temporal resolution. SlowFast's insight is that appearance needs high channels but few frames, while motion needs many frames but few channels. Splitting these responsibilities saves compute while improving accuracy.

SlowFast Results

SlowFast with a ResNet-101 backbone and non-local attention blocks achieved 79.8% top-1 on Kinetics-400, a significant jump from I3D's 74.2%. The non-local blocks (spatio-temporal self-attention inserted between 3D conv layers) help capture long-range dependencies that local 3D convolutions miss.

Definition

Non-Local Block (Spatio-Temporal Self-Attention)

Given features C × T × H × W, project to queries, keys, and values via 1×1×1 conv. Compute attention over all T×H×W positions. Each position attends to every other spatiotemporal position. Added as residual blocks into existing 3D CNNs. Equivalent to self-attention on a flattened sequence of T×H×W tokens. Cost: O((THW)²) — expensive but effective.

Chapter 07

Efficient Video Models

3D convolutions are powerful but expensive. Researchers have found clever ways to approximate their behavior at a fraction of the cost.

(2+1)D Convolutions

Instead of a single 3D convolution with kernel K_t×K_h×K_w, decompose it into two sequential operations:

(2+1)D Decomposition 3D Conv: C_out × C_in × K_t × K_h × K_w
↓ Decompose into:
1. Spatial 2D Conv: M × C_in × 1 × K_h × K_w (process space)
2. Temporal 1D Conv: C_out × M × K_t × 1 × 1 (process time)

M is an intermediate channel count chosen to match the parameter count of the original 3D conv.

This factorization doubles the number of nonlinearities (a ReLU after each sub-convolution) while having roughly the same parameter count. Tran et al. (2018) showed R(2+1)D matched or exceeded full 3D ResNets on Kinetics.

Worked Example — Parameter Savings

Full 3D conv: C_in=64, C_out=64, kernel 3×3×3. Params: 64 × 64 × 27 = 110,592.

(2+1)D with M=64: Spatial (64×64×9) + Temporal (64×64×3) = 36,864 + 12,288 = 49,152. That's 55% fewer parameters for the same receptive field. The extra ReLU between spatial and temporal convolution adds representational power for free.

Temporal Shift Module (TSM)

Lin et al. (2019) asked a radical question: what if we could capture temporal information using zero additional parameters and zero extra FLOPs?

Definition

Temporal Shift Module (TSM)

Given a feature tensor of shape C × T × H × W, TSM shifts a portion of the channels along the time axis. Specifically: shift the first C/8 channels forward by one frame, shift the next C/8 channels backward by one frame, and leave the remaining 3C/4 channels unchanged. This creates temporal mixing between adjacent frames at zero cost — it's just a memory move, not a computation.

TSM Operation Given features X of shape C × T × H × W:
X[0:C/8, t, :, :] ← X[0:C/8, t-1, :, :]   (shift backward: future info)
X[C/8:C/4, t, :, :] ← X[C/8:C/4, t+1, :, :]   (shift forward: past info)
X[C/4:, t, :, :] ← X[C/4:, t, :, :]   (no change: current frame)

When you insert TSM before a standard 2D convolution, the 2D conv now processes a mixture of features from the current frame, the previous frame, and the next frame. The spatial convolution implicitly becomes a spatiotemporal operation — without any 3D weights.

The Elegance of TSM

TSM is profound because it means any 2D CNN (ResNet, MobileNet, EfficientNet) can be converted to a video model by simply inserting channel shifts before each residual block. No architecture changes. No new parameters. No new FLOPs. A TSM-ResNet-50 achieves comparable accuracy to 3D ResNets on Kinetics while running at the same speed as a 2D ResNet-50. This makes it especially attractive for deployment on mobile devices and edge hardware.

Worked Example — How TSM Creates Temporal Features

Frame t-1 has a ball on the left. Frame t has the ball in the center. Frame t+1 has the ball on the right. After TSM, the feature map at time t contains: some channels from t-1 (ball-left features), most channels from t (ball-center features), and some channels from t+1 (ball-right features).

When the 2D conv processes this mixed feature map, it sees ball-left, ball-center, and ball-right information simultaneously. A filter that responds to "ball moving right" will fire — exactly like a temporal convolution would, but without 3D weights.

Efficiency Comparison

Model	GFLOPs	Kinetics-400 Top-1	Extra Params vs 2D
ResNet-50 (2D, per-frame)	~4	~63%	0
C3D	39.5	~65%	All new
I3D (Inception)	~108	71.1%	Inflated
R(2+1)D-34	~152	72.0%	Factorized
TSM-ResNet-50	~4 × T	~74%	0 (zero)
SlowFast R-101+NL	~234	79.8%	Dual path

No Free Lunch

TSM only shifts channels by one frame. Its temporal receptive field grows linearly with depth (one more frame per ResNet block), not exponentially like 3D pooling. For actions requiring very long temporal reasoning, TSM may underperform heavier models. Also, TSM doesn't help the first layer — there are no features to shift yet. For maximum accuracy, you still want 3D convolutions or transformers. TSM shines when compute budget is tight.

Chapter 08

Video Transformers

Self-attention doesn't care about local neighborhoods. It compares every position to every other position. For video, that means every pixel at every frame can attend to every pixel at every other frame. In principle, this gives infinite receptive field from layer one — no gradual buildup needed.

But there's a problem: a video clip with T=8, H=14, W=14 (after patch embedding) has T × H × W = 1,568 tokens. Full self-attention is O(N²), so the attention matrix has 1,568² ≈ 2.5 million entries. For longer clips or higher resolution, this explodes.

TimeSformer: Factorized Attention

Bertasius et al. (2021) proposed Divided Space-Time Attention: instead of full spatiotemporal attention, alternate between two types of attention in each transformer block:

TimeSformer: Divided Space-Time Attention

Temporal attention: Each spatial position (h, w) attends only across time (comparing the same patch location across all T frames). Complexity: O(T²) per position, O(T² × H × W) total.
Spatial attention: Each time step t attends only across space (standard ViT attention within one frame). Complexity: O((HW)²) per frame, O((HW)² × T) total.

Attention Complexity Comparison Full spatiotemporal: O((T × H × W)²) = O(T² × H² × W²)
Factorized (TimeSformer): O(T² × H × W + T × (H × W)²)

With T=8, H=W=14: Full = 2.5M, Factorized = 16K + 31K = 47K. That's a 53× reduction.

Worked Example — TimeSformer Attention

Consider a patch at position (h=5, w=7) in a clip with T=8 frames. In temporal attention, this patch attends to the 8 patches at (5, 7) across all frames — comparing "what does this same location look like over time?" This captures motion at that position. In spatial attention, the patch at frame t=3 attends to all 14×14=196 patches within frame 3 — standard image self-attention. By alternating, the model builds spatiotemporal representations without the quadratic blowup.

ViViT: Four Factorization Strategies

Arnab et al. (2021) systematically compared four ways to factorize video transformers:

Model	Strategy	Description
ViViT-1	Spatio-temporal tokens	Full attention over all T×H×W tokens. Most accurate, most expensive.
ViViT-2	Factorized encoder	Spatial transformer first, then temporal transformer on CLS tokens.
ViViT-3	Factorized self-attention	TimeSformer-style: alternate spatial and temporal attention in each block.
ViViT-4	Factorized dot-product	Factorize the attention matrix itself: A = A_spatial × A_temporal.

Tubelet Embedding

ViViT introduces tubelet embedding: instead of embedding individual 2D patches, embed 3D tubes (t × h × w) from the video volume. A tubelet of size 2×16×16 means each token represents 2 frames and 16×16 pixels. This reduces the number of tokens (fewer = cheaper attention) while giving each token temporal context from the start. Think of it as 3D patch embedding — the video equivalent of ViT's 2D patch embedding.

VideoMAE: Self-Supervised Pretraining

Tong et al. (2022) extended Masked Autoencoders (MAE) to video. The key insight: video is highly redundant. Consecutive frames are nearly identical. You can mask a very high ratio of tubes (90-95%) and still reconstruct the video, because adjacent visible tubes provide strong clues.

Definition

VideoMAE

A self-supervised pretraining method for video transformers. Divide the video into spatiotemporal tubes. Randomly mask 90-95% of them. Feed the visible tubes into a ViT encoder. Reconstruct the masked tubes with a lightweight decoder. The encoder learns rich spatiotemporal representations without any labeled data.

VideoMAE V2 (Wang et al., 2023) scaled this approach to ViT-giant with 1 billion parameters, achieving 90.0% top-1 on Kinetics-400 — the current state of the art for single models. The high masking ratio makes pretraining efficient: the encoder only processes 5-10% of tokens, dramatically reducing FLOPs and memory.

Worked Example — VideoMAE Masking Efficiency

Input: 16 frames, 14×14 spatial grid, tubelet size 2×16×16. Total tubes: (16/2) × 14 × 14 = 1,568. At 90% masking: only 157 visible tubes are processed by the encoder. The encoder's self-attention cost drops from O(1568²) to O(157²) — a 100× reduction. This makes it practical to pretrain very large models on massive video datasets.

The State of the Art

Model	Year	Kinetics-400 Top-1	Architecture
Per-frame CNN	2014	63.3%	2D Inception
Two-Stream I3D	2017	74.2%	Inflated Inception
SlowFast R-101+NL	2019	79.8%	Dual-path 3D ResNet
MViTv2-L	2022	86.1%	Multiscale ViT
VideoMAE V2-g	2023	90.0%	ViT-giant + MAE

The Transformer Takeover

The progression is clear: 63% → 74% → 80% → 86% → 90%. Each major jump came from a paradigm shift: 2D CNNs → 3D CNNs → efficient 3D architectures → vision transformers → self-supervised pretraining. The current best (VideoMAE V2) combines ViT's global attention with masked autoencoder pretraining at enormous scale. But notice: even the 2017 I3D was already 74%. The last 16 percentage points took six more years of research.

Chapter 09

Architecture Comparison

Let's bring everything together. Below is an interactive visualization showing how each video architecture processes temporal information differently. Select an architecture to see its computational pattern, accuracy, cost, and key innovation.

Showcase: Video Architecture Comparison

Select an architecture to see how it processes video frames, its accuracy on Kinetics-400, compute cost, and processing pattern.

The Design Space

Every architecture in this lecture makes a specific trade-off along three axes:

Architecture	Temporal Modeling	Compute Cost	Key Innovation
Single Frame	None (avg at test)	Very low	Baseline
Two-Stream	Optical flow stream	Medium + flow	Explicit motion input
C3D	3D conv throughout	High (39.5G)	VGG-style 3D
I3D	Inflated 3D conv	High (108G)	ImageNet weight transfer
SlowFast	Dual temporal rate	High (234G)	Slow + fast pathways
TSM	Channel shifting	Very low (= 2D)	Zero-cost temporal mixing
TimeSformer	Factorized attention	Medium-high	Divided space-time attn
VideoMAE V2	Full attention + MAE	High (pretraining)	90% masking ratio

Choosing an Architecture

Compute-limited? Use TSM — it turns any 2D CNN into a video model at zero extra cost. Need maximum accuracy? VideoMAE V2 with a ViT-giant backbone. Want a good middle ground? SlowFast or MViTv2. Need interpretable motion features? Two-stream with optical flow. The right answer depends on your deploy environment, dataset size, and latency requirements.

Historical Arc

2014: Karpathy's fusion experiments show single-frame is embarrassingly strong. 2014: Simonyan's two-stream network proves motion matters. 2015: C3D shows 3D convolutions can learn temporal features end-to-end. 2017: I3D shows ImageNet pretraining transfers to video. 2019: SlowFast shows asymmetric temporal resolution is better. 2019: TSM shows you can do temporal modeling for free. 2021: TimeSformer/ViViT bring transformers to video. 2022-23: VideoMAE shows self-supervised pretraining at scale dominates everything else.

Chapter 10

Multimodal Video & Connections

Video doesn't exist in a vacuum. Real videos have audio — speech, music, ambient sounds — that provides powerful complementary signals. A person's lips move in sync with their voice. Musical instruments produce characteristic sounds. Dogs bark and birds chirp. Multimodal video understanding fuses visual and audio streams for richer scene comprehension.

Audio-Visual Fusion

The McGurk effect demonstrates how deeply intertwined vision and audio are in human perception. When you hear "ba" while watching lips say "fa," you perceive "da" — your brain fuses the conflicting signals into a third percept. Neural networks can similarly learn to fuse audio and video.

Definition

Audio-Visual Source Separation

Given a video with a mixture of sounds (two people speaking, multiple instruments playing), separate the audio into individual sources using the visual signal as guidance. If you can see a person's lips moving, you can isolate their voice from the mixture. If you can see a guitar being strummed, you can separate the guitar audio from the violin playing simultaneously.

Gao et al. (2021) demonstrated this with VisualVoice: given a video of two people speaking simultaneously, the model separates their voices by using each person's face and lip movements to guide the audio separation. The visual modality acts as a "query" that tells the audio model whose voice to extract.

Efficient Video Understanding

For long-form videos (minutes to hours), processing every frame is impractical even with efficient models. Several strategies address this:

Approaches to Long Video

Salient clip sampling (Korbar et al., 2019): Instead of uniform sampling, predict which clips are most informative and only process those. Use a lightweight "preview" model to score candidate clips.

Audio as preview (Gao et al., 2020): Use audio features to decide which video frames to process. Audio is 1000× cheaper to process than video. If the audio suggests "someone is speaking," process the corresponding frames; if "silence," skip them.

Mobile architectures (MoViNets, Kondratyuk et al., 2021): Stream-based architectures that process one frame at a time using causal temporal convolutions, maintaining a fixed-size hidden state. No need to buffer entire clips.

Beyond Classification

Video understanding extends well beyond action classification:

Task	Input	Output
Temporal action localization	Long untrimmed video	Start/end times + labels for each action
Spatio-temporal detection	Video sequence	Bounding boxes + actions per person per frame
Video captioning	Video clip	Natural language description
Video question answering	Video + question text	Natural language answer
Egocentric understanding	First-person video	Activity/object/interaction recognition

Video + LLMs

The latest frontier: combining video encoders with large language models. Systems like Video-LLaVA (Lin et al., 2024) and VideoLLaMA 3 (Zhang et al., 2025) embed video frames using a ViT encoder, project the visual tokens into the LLM's embedding space, and let the LLM reason about the video using its language capabilities. This enables open-ended video understanding: "What is happening in this video?" "Why did the person look surprised?" "What will happen next?"

The Trajectory

Video understanding has followed a clear arc: hand-crafted motion (optical flow) → learned 3D features (C3D/I3D) → efficient temporal modeling (SlowFast/TSM) → global attention (transformers) → self-supervised pretraining (VideoMAE) → multimodal reasoning (Video-LLMs). Each step increased both the temporal reasoning capacity and the generality of the representations. The field is converging toward unified models that see, hear, and reason about video in natural language.

Key Takeaways

Concept	Key Insight
Temporal modeling	Late fusion is lazy; slow fusion (3D conv) builds temporal features gradually
Motion representation	Optical flow provides explicit motion; 3D conv learns implicit motion
Weight reuse	Inflate 2D ImageNet weights to 3D — warm start beats cold start
Asymmetric design	Appearance needs channels, motion needs frames (SlowFast)
Zero-cost temporal	Channel shifting (TSM) gives temporal modeling for free
Attention factorization	Separate spatial and temporal attention avoids O(N²) blowup
Self-supervised scaling	90% masking + large ViTs = state-of-the-art without labels

Understanding Video with Deep Learning

What You'll Master

From Images to Video

The Scale Problem

Training on Clips

Images vs. Videos: What Changes

Simple Approaches: Temporal Fusion

Late Fusion

Early Fusion

Slow Fusion (3D CNN)

Results on Sports-1M

Two-Stream Networks

Optical Flow

The Two-Stream Architecture

Results: Motion Matters

Long-Term Temporal Modeling with RNNs

3D Convolutions & C3D

2D Conv (Early Fusion) vs 3D Conv

C3D: The VGG of 3D CNNs

Inflated 3D Networks (I3D)

The Inflation Recipe

I3D Results

SlowFast Networks

Dual Pathways

Lateral Connections

SlowFast Results

Efficient Video Models

(2+1)D Convolutions

Temporal Shift Module (TSM)

Efficiency Comparison

Video Transformers

TimeSformer: Factorized Attention

ViViT: Four Factorization Strategies

VideoMAE: Self-Supervised Pretraining

The State of the Art

Architecture Comparison

The Design Space

Multimodal Video & Connections

Audio-Visual Fusion

Efficient Video Understanding

Beyond Classification

Video + LLMs

Key Takeaways

Related Reading