Computer Vision · Video Understanding

Action Recognition
From Absolute Zero

How does a computer watch a video and say "that's playing baseball" — when all it sees is pixels changing over time?

Prerequisites: Basic CNN intuition + curiosity about video. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Is This Hard?

You see a video of someone swinging a bat. Instantly you know: playing baseball. But how? The video is just a grid of pixels changing over time. There is no label embedded in the photons. Your brain is doing something extraordinary — recognizing a pattern that spans both space (the shape of a bat, a person's pose) and time (the swing motion, the follow-through).

Now give this task to a computer. A video is a tensor of shape [T, 3, H, W] — T frames, each with 3 color channels, at height H and width W. A 10-second clip at 30 fps is 300 frames × 3 × 224 × 224 = 45 million numbers. Somewhere in those 45 million numbers is the information "playing baseball." The computer must find it.

The core challenge: Images have spatial patterns (edges, textures, objects). Video adds a whole new dimension — temporal patterns (motion, rhythm, sequence). "Waving" and "pointing" can look identical in a single frame. Only motion distinguishes them.
Same Pose, Different Action

Two stick figures with the same pose in the middle frame. Click Play to see the motion — one is waving, the other is pointing. A single frame cannot tell them apart.

This is why action recognition is fundamentally harder than image classification. ImageNet asks "what object is this?" — a spatial question. Kinetics asks "what activity is happening?" — a spatiotemporal question. You need to understand both what things look like and how they move.

Why can't a single image frame always determine the action?

Chapter 1: Single Frame — The Naive Baseline

The simplest approach: take each frame, run it through an image classifier (a CNN like ResNet), and average the predictions. For every frame independently: input [3, 224, 224] → ResNet → softmax over K action classes.

This works surprisingly often. If you see a swimming pool, you can guess "swimming." If you see a tennis court with a racket, you can guess "playing tennis." These are scene-biased actions — the background gives it away.

Video
[T, 3, 224, 224]
↓ sample N frames
Each Frame
[3, 224, 224] → ResNet → [K] logits
↓ average N score vectors
Prediction
argmax of [K] averaged scores

But this fails catastrophically for temporal actions. Waving vs. pointing. Opening a door vs. closing a door. Picking something up vs. putting it down. The per-frame appearance is nearly identical — only the direction of motion over time tells them apart.

The dataset test: On UCF-101, per-frame CNNs get ~73% accuracy. Impressive — but a lot of UCF-101 is scene-biased (playing guitar with a guitar visible, skiing on snow). On Something-Something v2, which requires temporal reasoning ("pushing something from left to right" vs. "right to left"), per-frame CNNs crash to ~20%. The gap reveals what's missing: motion understanding.
Per-Frame Classification

Watch a CNN classify individual frames. The prediction flickers — some frames look ambiguous. Click Run to see per-frame confidence scores for two actions.

DatasetPer-Frame CNNTemporal ModelGap
UCF-101 (scene-biased)~73%~95%22%
Something-Something (temporal)~20%~65%45%
Kinetics-400 (mixed)~62%~79%17%
When does single-frame classification work well?

Chapter 2: Two-Stream Networks

In 2014, Simonyan & Zisserman had a beautiful insight: the human visual cortex processes appearance and motion in separate pathways (the ventral and dorsal streams). What if we give the CNN two separate inputs — one for what things look like, and one for how they move?

Optical flow is a field that captures pixel motion between frames. At each pixel, it stores a 2D vector (dx, dy) — how far and in which direction that pixel moved. Stack L consecutive flow frames and you get a motion "image" of shape [2L, 224, 224].

Spatial Stream
RGB frame [3, 224, 224] → CNN → [K] scores
 
Temporal Stream
Stacked flow [2L, 224, 224] → CNN → [K] scores
↓ late fusion: average
Final Prediction
argmax of averaged [K] scores

Why stacked flow? A single flow frame captures instantaneous motion, but actions have temporal extent. Stacking L=10 flow frames (so 20 channels — 10 horizontal, 10 vertical) gives the temporal CNN a window of motion to reason over. The input acts like a "motion image" that a standard 2D CNN can read.

Late fusion is simple but powerful. Each stream produces a K-dimensional score vector. Late fusion just averages them: final = 0.5 × spatial + 0.5 × temporal. This works because the streams capture complementary information — appearance ("there's a ball and a bat") and motion ("swinging motion"). On UCF-101, spatial alone gets ~73%, temporal alone gets ~83%, fused gets ~88%.
Two-Stream Fusion

Adjust the fusion weight between spatial (appearance) and temporal (motion) streams. Watch how confidence changes for a temporal action like "waving."

Spatial weight0.50

The cost: optical flow must be precomputed for every video. Traditional methods (TV-L1) take ~0.06s per frame pair on a GPU. For Kinetics-400 with 300K videos averaging 250 frames each, that's ~1250 GPU-hours just for flow extraction — before any training. This bottleneck motivated the move to methods that learn motion features directly.

What does the temporal stream's input (stacked optical flow) represent?

Chapter 3: 3D Convolutions — C3D and I3D

Two-stream networks need precomputed optical flow. What if the network could learn to extract motion features by itself? The idea: extend 2D convolutions into 3D. A 2D conv kernel is [k, k] — it slides over height and width. A 3D conv kernel is [k, k, k] — it slides over time, height, and width simultaneously.

2D conv: kernel [Cout, Cin, kh, kw]  →  output [Cout, H', W']
3D conv: kernel [Cout, Cin, kt, kh, kw]  →  output [Cout, T', H', W']

C3D (Tran et al., 2015) was the first major architecture: 8 layers of 3D convolutions with [3, 3, 3] kernels, processing 16-frame clips of shape [3, 16, 112, 112]. Each 3D conv layer captures local spatiotemporal patterns — an edge that moves, a texture that changes, a limb that rotates.

Why 3×3×3? Tran et al. tested [1,3,3] (spatial only), [3,1,1] (temporal only), and [3,3,3] (joint). The joint kernel won consistently. This means spatiotemporal features aren't separable — motion and appearance interact, and the network needs to see them together at every layer.

I3D (Carreira & Zisserman, 2017) had a smarter idea: inflate a pretrained 2D network into 3D. Take an Inception-v1 trained on ImageNet. Every 2D conv kernel [k, k] becomes [k, k, k] by repeating the weights along the time axis and dividing by k. Every 2D pooling layer gets a temporal dimension. The result: a 3D network that starts with strong spatial features from ImageNet and only needs to learn the temporal part.

Video Clip
[3, 16, 224, 224]
↓ 3D conv layers
Feature Maps
[C, T', H', W'] — shrink in all 3 dims
↓ global average pool
Feature Vector
[1024]
↓ linear classifier
Prediction
[K] class scores
2D vs 3D Convolution

A 2D kernel slides over space only. A 3D kernel slides over space and time. Watch the orange kernel move through a video volume.

The inflation trick in detail: a 2D kernel W of shape [C_out, C_in, k, k] becomes W_3d of shape [C_out, C_in, k, k, k]. Each temporal slice gets W / k. This ensures the 3D kernel produces the same output as the 2D kernel when applied to a static (repeated) image — so the pretrained features transfer perfectly. Then fine-tuning on video teaches the temporal dimension.

What is the key advantage of I3D's "inflation" approach over training a 3D CNN from scratch?

Chapter 4: SlowFast Networks

Here's a biological insight: in your visual cortex, ~80% of cells respond slowly (sustained, color-sensitive, high spatial detail) and ~20% respond rapidly (transient, motion-sensitive, lower spatial detail). Feichtenhofer et al. (2019) turned this into an architecture with two pathways operating at different temporal resolutions.

The Slow pathway processes 4 frames per second — few frames but high channel capacity (e.g., 64 channels). It captures what things look like in fine spatial detail. The Fast pathway processes 32 frames per second — many frames but lightweight (e.g., 8 channels, which is β=1/8 of the Slow path). It captures how things move with fine temporal resolution.

Slow Pathway (4 fps)
[3, 4, 224, 224] → ResNet (64ch) → rich spatial features
 
Fast Pathway (32 fps)
[3, 32, 224, 224] → ResNet (8ch) → fine temporal features
↓ lateral connections (time-strided conv)
Fusion → Classifier
Concatenate features → [K] class scores
The asymmetry is the insight. The Fast pathway has β=1/8 the channels of the Slow pathway, making it very lightweight (~20% of total compute). Yet it processes 8× more frames. This means SlowFast gets fine temporal resolution almost for free. The lateral connections fuse Fast features into Slow via time-strided 3D convolutions that downsample 32 frames to 4.

Lateral connections: at each ResNet stage, the Fast pathway's feature map (e.g., [8, 32, 56, 56]) is transformed to match the Slow pathway's temporal dimension via a 3D conv with kernel [5, 1, 1] and stride [8, 1, 1] in time. This produces [8, 4, 56, 56], which gets concatenated channel-wise with the Slow features [64, 4, 56, 56] to give [72, 4, 56, 56].

SlowFast Frame Sampling

Drag the time slider to scrub through a video. The Slow pathway samples every 8th frame (4 fps). The Fast pathway samples every frame (32 fps). Notice how the Slow path sees "keyframes" while the Fast path sees smooth motion.

Time0
Stride ratio (τ)8
ComponentSlowFast
Frame rate4 fps (τ=16 stride)32 fps (τ/α=2 stride)
Channels648 (β=1/8)
Temporal framesT/τ = 4T/(τ/α) = 32
Compute share~80%~20%
CapturesSpatial detail, semanticsMotion, temporal patterns

Results: SlowFast R-101 achieves 79.8% on Kinetics-400, outperforming I3D (74.7%) and single-pathway R-101 (76.5%). The dual-pathway design is strictly better than making one pathway wider — the asymmetric temporal sampling captures information that a single frame rate misses.

Why does the Fast pathway use far fewer channels than the Slow pathway?

Chapter 5: Video Transformers

CNNs capture local patterns with their fixed-size kernels. But some actions require long-range reasoning — the setup at frame 10 relates to the payoff at frame 90. Transformers, with their global self-attention, are a natural fit. The challenge: video has too many tokens for full attention.

Tokenization: just like ViT splits an image into patches, video transformers split a video into tubelets — 3D patches spanning time, height, and width. A video [T, 3, H, W] with tubelet size [t, p, p] produces (T/t) × (H/p) × (W/p) tokens, each a flattened vector of size t × p × p × 3 projected to dimension D.

Ntokens = (T/t) × (H/p) × (W/p)

For T=16, H=W=224, t=2, p=16: that's 8 × 14 × 14 = 1568 tokens. Full self-attention on 1568 tokens costs O(1568²) = 2.5M operations per layer. Manageable, but it grows fast with longer videos.

TimeSformer's factored attention (Bertasius et al., 2021): instead of attending all 1568 tokens to each other, split attention into two steps. First, each token attends only to tokens at the same spatial position across all time steps (temporal attention, cost: 8 tokens). Then, each token attends to all tokens at the same time step (spatial attention, cost: 196 tokens). Total: O(8 + 196) per token instead of O(1568). That's a 7.7× reduction.
Video
[T, 3, H, W] = [16, 3, 224, 224]
↓ tubelet embedding
Tokens + CLS
[1569, D] — 1568 patch tokens + 1 class token
↓ L transformer blocks (factored attention)
CLS token
[D] → linear → [K] class scores

ViViT (Arnab et al., 2021) explored four factorization strategies and found that a two-stage approach works best: a spatial encoder processes each frame independently, then a temporal encoder processes the sequence of per-frame CLS tokens. This is late temporal fusion at the transformer level.

VideoMAE (Tong et al., 2022) took self-supervised pretraining to video: mask 90% of tubelet tokens, and train the transformer to reconstruct them. Why 90%? Because video has enormous temporal redundancy — neighboring frames are nearly identical. Masking 90% forces the model to learn actual motion and structure, not just copy from nearby frames. After pretraining, fine-tune for classification.

Factored vs Full Attention

Compare attention patterns. Full attention connects every token to every other. Factored attention separates space and time. Adjust grid size to see how cost scales.

Spatial grid4
Time steps4
Why does VideoMAE mask 90% of tokens (vs 75% for image MAE)?

Chapter 6: Temporal Action Detection

So far we've been classifying trimmed clips — short videos containing exactly one action. In the real world, videos are untrimmed. A 2-hour movie contains hundreds of actions at different times. Temporal action detection answers: when does each action happen?

The output isn't a single label — it's a list of (start time, end time, action class, confidence score) for every detected action instance. Think of it as object detection, but in 1D (time) instead of 2D (image).

Long Video
[Tlong, 3, H, W] — e.g., 5 minutes
↓ feature extraction (pretrained backbone)
Feature Sequence
[T', D] — one feature vector per snippet
↓ temporal model (e.g., ActionFormer)
Detections
List of (tstart, tend, class, conf)

ActionFormer (Zhang et al., 2022) builds a feature pyramid over the temporal dimension, with levels at different scales (short actions at fine scale, long actions at coarse scale). At each level, local self-attention attends over nearby time steps. Each time step predicts: (1) the action class, and (2) the distance to the start and end of the action. This is analogous to how FCOS detects objects in images — anchor-free, per-point regression.

The feature pyramid intuition: a "clapping" action lasts 2 seconds. A "cooking a meal" action lasts 5 minutes. You need different temporal receptive fields to detect them. The pyramid gives you levels at 1s, 2s, 4s, 8s, ... resolution. Short actions are found at fine levels, long actions at coarse levels.
Temporal Detection Timeline

A long video with multiple actions at different times. The detector outputs colored segments with confidence scores. Click Detect to run.

MethodApproachmAP on ActivityNet
BMN (2019)Proposal + classification50.1%
VSGN (2021)Graph-based proposals52.4%
ActionFormer (2022)Anchor-free pyramid + local attn54.7%
TriDet (2023)Trident head on pyramid55.4%
Why does temporal action detection use a feature pyramid?

Chapter 7: Skeleton-Based Recognition

All the methods so far process raw pixels. But actions are fundamentally about body movement. What if we skip the pixels entirely and work with skeleton keypoints — the (x, y) coordinates of body joints over time? This is the idea behind skeleton-based action recognition.

A pose estimator (like OpenPose or HRNet) extracts N joint positions per frame. For a standard body model, N=17 joints (nose, eyes, shoulders, elbows, wrists, hips, knees, ankles). Over T frames, the input is a tensor of shape [N, 3, T] — N joints, each with (x, y, confidence), across T time steps. This is dramatically smaller than raw video: 17 × 3 × 64 = 3,264 numbers vs 64 × 3 × 224 × 224 = 9.6 million.

A body is a graph, not a grid. Joints aren't arranged in a regular grid like pixels. The shoulder connects to the elbow, the elbow to the wrist — this is a graph structure. Standard CNNs assume grid inputs. We need Graph Convolutional Networks (GCNs) that respect the body's topology.

ST-GCN (Yan et al., 2018) defines two types of edges: spatial edges (bones connecting joints in a single frame — shoulder-to-elbow, hip-to-knee) and temporal edges (the same joint across consecutive frames — left wrist at time t to left wrist at time t+1). A graph convolution aggregates features from neighboring nodes:

fout(v) = ∑u ∈ N(v) W(u) · fin(u)

Where N(v) is the set of neighbors of joint v in the spatiotemporal graph, and W(u) is a learnable weight that depends on the relative position of u to v (center, centripetal, centrifugal partitioning). Stack multiple ST-GCN layers, and the receptive field grows — a wrist learns about what the elbow and shoulder are doing, which in turn know about the torso.

Interactive Skeleton

A skeleton performing an action over time. Orange = spatial edges (bones). Teal dotted = temporal edges (same joint across time). Click Play to animate. The GCN "reads" both edge types simultaneously.

Why skeletons matter: they're invariant to background, lighting, clothing, and camera angle. A "waving" skeleton looks the same whether you're in a park or a kitchen, wearing red or blue. This makes skeleton models excellent for cross-domain generalization. The tradeoff: you lose object and scene context (can't distinguish "eating an apple" from "eating a sandwich" by skeleton alone).

What do "temporal edges" in the ST-GCN graph connect?

Chapter 8: Training & Datasets

Training an action recognition model is expensive. A single video clip is 16–64 frames, each a full image. Backpropagating through a 3D CNN or video transformer on a batch of clips requires 4–16× more memory than image training. Here's how practitioners handle it.

Clip Sampling

You can't feed an entire video to the network. Instead, sample short clips during training. Common strategies:

StrategyMethodUsed by
UniformDivide video into T segments, sample 1 frame per segmentTSN, TSM
RandomPick a random start point, take T consecutive framesC3D, I3D
Multi-clipAt test time, sample K clips from different positions, average scoresSlowFast, ViViT
Multi-crop testing: at test time, SlowFast uses 10 clips × 3 spatial crops = 30 views per video. Each clip is sampled from a different temporal position, and each crop covers left/center/right of the frame. The 30 softmax vectors are averaged. This boosts accuracy by ~2% but costs 30× the inference compute.

The Big Datasets

DatasetClassesClipsTests
Kinetics-400400~306KAppearance + motion
Something-Something v2174~221KTemporal reasoning (egocentric)
AVA v2.280430 videosSpatiotemporal detection (who does what where)
EPIC-Kitchens-10097 verbs, 300 nouns90K segmentsEgocentric, fine-grained
ActivityNet200~20KUntrimmed temporal detection

Kinetics is the ImageNet of video — large-scale, diverse, good for pretraining. Something-Something is the stress test — "pushing something from left to right" requires understanding motion direction, not just scene context. Models that cheat with scene bias fail here. AVA combines detection with recognition — for each person in each keyframe, predict a set of actions (one person can be "walking" and "talking on phone" simultaneously).

Computational Cost

Training a SlowFast R-101 on Kinetics-400 for 256 epochs takes ~128 GPU-days on V100s. A ViViT-L takes ~200 GPU-days. VideoMAE pretraining adds another ~100 GPU-days. This is why transfer learning from Kinetics is standard — most labs can't afford to train from scratch.

Clip Sampling Strategies

A video timeline with 32 frames. See how uniform, random, and dense sampling select different frames. Click Sample to resample.

Why is Something-Something v2 harder than Kinetics-400 for scene-biased models?

Chapter 9: Connections

Action recognition has evolved through a clear lineage, each generation addressing the limitations of the last:

EraMethodKey ideaLimitation
2014Two-Stream CNNsSeparate appearance + motionRequires precomputed optical flow
2015C3D / 3D CNNsLearn spatiotemporal features jointlyNo ImageNet pretraining, small datasets
2017I3D (inflate 2D→3D)Transfer ImageNet features to videoFixed temporal window, expensive
2019SlowFastDual frame rate, asymmetric designStill CNN-based, limited long-range
2021TimeSformer / ViViTTransformer attention over space+timeQuadratic cost, needs large data
2022VideoMAESelf-supervised pretraining for videoPretraining cost, still clip-level
2023+Video foundation modelsText-video pretraining, zero-shotMassive compute, emerging field
The big trend: the field is moving from hand-crafted temporal features (optical flow) to learned temporal features (3D conv, attention), from supervised pretraining (Kinetics) to self-supervised pretraining (VideoMAE), and from fixed classification (400 labels) to open-vocabulary recognition (describe the action in text).

Related Lessons

Upstream: Transformers (the attention mechanism behind video ViTs), Contrastive Learning & CLIP (text-video alignment for zero-shot recognition).

Downstream: Vision-Language Models (describing actions in natural language), World Models (predicting future actions and states from video).

Adjacent: Vision-Language-Action Models (from recognizing actions to performing them in robots).

Cheat sheet: when to use what.
Short trimmed clips + enough labeled data: SlowFast or Video Swin — best accuracy/cost tradeoff.
Long untrimmed videos: ActionFormer on top of a SlowFast backbone — temporal detection.
Privacy-sensitive / cross-domain: ST-GCN on skeleton keypoints — no appearance info needed.
Limited labeled data: VideoMAE self-supervised pretrain, then fine-tune — label-efficient.
Open-vocabulary / zero-shot: Text-video model (e.g., VideoCLIP) — no fixed class set needed.

"The purpose of computing is insight, not numbers." — Richard Hamming