How does a computer watch a video and say "that's playing baseball" — when all it sees is pixels changing over time?
You see a video of someone swinging a bat. Instantly you know: playing baseball. But how? The video is just a grid of pixels changing over time. There is no label embedded in the photons. Your brain is doing something extraordinary — recognizing a pattern that spans both space (the shape of a bat, a person's pose) and time (the swing motion, the follow-through).
Now give this task to a computer. A video is a tensor of shape [T, 3, H, W] — T frames, each with 3 color channels, at height H and width W. A 10-second clip at 30 fps is 300 frames × 3 × 224 × 224 = 45 million numbers. Somewhere in those 45 million numbers is the information "playing baseball." The computer must find it.
Two stick figures with the same pose in the middle frame. Click Play to see the motion — one is waving, the other is pointing. A single frame cannot tell them apart.
This is why action recognition is fundamentally harder than image classification. ImageNet asks "what object is this?" — a spatial question. Kinetics asks "what activity is happening?" — a spatiotemporal question. You need to understand both what things look like and how they move.
The simplest approach: take each frame, run it through an image classifier (a CNN like ResNet), and average the predictions. For every frame independently: input [3, 224, 224] → ResNet → softmax over K action classes.
This works surprisingly often. If you see a swimming pool, you can guess "swimming." If you see a tennis court with a racket, you can guess "playing tennis." These are scene-biased actions — the background gives it away.
But this fails catastrophically for temporal actions. Waving vs. pointing. Opening a door vs. closing a door. Picking something up vs. putting it down. The per-frame appearance is nearly identical — only the direction of motion over time tells them apart.
Watch a CNN classify individual frames. The prediction flickers — some frames look ambiguous. Click Run to see per-frame confidence scores for two actions.
| Dataset | Per-Frame CNN | Temporal Model | Gap |
|---|---|---|---|
| UCF-101 (scene-biased) | ~73% | ~95% | 22% |
| Something-Something (temporal) | ~20% | ~65% | 45% |
| Kinetics-400 (mixed) | ~62% | ~79% | 17% |
In 2014, Simonyan & Zisserman had a beautiful insight: the human visual cortex processes appearance and motion in separate pathways (the ventral and dorsal streams). What if we give the CNN two separate inputs — one for what things look like, and one for how they move?
Optical flow is a field that captures pixel motion between frames. At each pixel, it stores a 2D vector (dx, dy) — how far and in which direction that pixel moved. Stack L consecutive flow frames and you get a motion "image" of shape [2L, 224, 224].
Why stacked flow? A single flow frame captures instantaneous motion, but actions have temporal extent. Stacking L=10 flow frames (so 20 channels — 10 horizontal, 10 vertical) gives the temporal CNN a window of motion to reason over. The input acts like a "motion image" that a standard 2D CNN can read.
final = 0.5 × spatial + 0.5 × temporal. This works because the streams capture complementary information — appearance ("there's a ball and a bat") and motion ("swinging motion"). On UCF-101, spatial alone gets ~73%, temporal alone gets ~83%, fused gets ~88%.Adjust the fusion weight between spatial (appearance) and temporal (motion) streams. Watch how confidence changes for a temporal action like "waving."
The cost: optical flow must be precomputed for every video. Traditional methods (TV-L1) take ~0.06s per frame pair on a GPU. For Kinetics-400 with 300K videos averaging 250 frames each, that's ~1250 GPU-hours just for flow extraction — before any training. This bottleneck motivated the move to methods that learn motion features directly.
Two-stream networks need precomputed optical flow. What if the network could learn to extract motion features by itself? The idea: extend 2D convolutions into 3D. A 2D conv kernel is [k, k] — it slides over height and width. A 3D conv kernel is [k, k, k] — it slides over time, height, and width simultaneously.
C3D (Tran et al., 2015) was the first major architecture: 8 layers of 3D convolutions with [3, 3, 3] kernels, processing 16-frame clips of shape [3, 16, 112, 112]. Each 3D conv layer captures local spatiotemporal patterns — an edge that moves, a texture that changes, a limb that rotates.
[1,3,3] (spatial only), [3,1,1] (temporal only), and [3,3,3] (joint). The joint kernel won consistently. This means spatiotemporal features aren't separable — motion and appearance interact, and the network needs to see them together at every layer.I3D (Carreira & Zisserman, 2017) had a smarter idea: inflate a pretrained 2D network into 3D. Take an Inception-v1 trained on ImageNet. Every 2D conv kernel [k, k] becomes [k, k, k] by repeating the weights along the time axis and dividing by k. Every 2D pooling layer gets a temporal dimension. The result: a 3D network that starts with strong spatial features from ImageNet and only needs to learn the temporal part.
A 2D kernel slides over space only. A 3D kernel slides over space and time. Watch the orange kernel move through a video volume.
The inflation trick in detail: a 2D kernel W of shape [C_out, C_in, k, k] becomes W_3d of shape [C_out, C_in, k, k, k]. Each temporal slice gets W / k. This ensures the 3D kernel produces the same output as the 2D kernel when applied to a static (repeated) image — so the pretrained features transfer perfectly. Then fine-tuning on video teaches the temporal dimension.
Here's a biological insight: in your visual cortex, ~80% of cells respond slowly (sustained, color-sensitive, high spatial detail) and ~20% respond rapidly (transient, motion-sensitive, lower spatial detail). Feichtenhofer et al. (2019) turned this into an architecture with two pathways operating at different temporal resolutions.
The Slow pathway processes 4 frames per second — few frames but high channel capacity (e.g., 64 channels). It captures what things look like in fine spatial detail. The Fast pathway processes 32 frames per second — many frames but lightweight (e.g., 8 channels, which is β=1/8 of the Slow path). It captures how things move with fine temporal resolution.
Lateral connections: at each ResNet stage, the Fast pathway's feature map (e.g., [8, 32, 56, 56]) is transformed to match the Slow pathway's temporal dimension via a 3D conv with kernel [5, 1, 1] and stride [8, 1, 1] in time. This produces [8, 4, 56, 56], which gets concatenated channel-wise with the Slow features [64, 4, 56, 56] to give [72, 4, 56, 56].
Drag the time slider to scrub through a video. The Slow pathway samples every 8th frame (4 fps). The Fast pathway samples every frame (32 fps). Notice how the Slow path sees "keyframes" while the Fast path sees smooth motion.
| Component | Slow | Fast |
|---|---|---|
| Frame rate | 4 fps (τ=16 stride) | 32 fps (τ/α=2 stride) |
| Channels | 64 | 8 (β=1/8) |
| Temporal frames | T/τ = 4 | T/(τ/α) = 32 |
| Compute share | ~80% | ~20% |
| Captures | Spatial detail, semantics | Motion, temporal patterns |
Results: SlowFast R-101 achieves 79.8% on Kinetics-400, outperforming I3D (74.7%) and single-pathway R-101 (76.5%). The dual-pathway design is strictly better than making one pathway wider — the asymmetric temporal sampling captures information that a single frame rate misses.
CNNs capture local patterns with their fixed-size kernels. But some actions require long-range reasoning — the setup at frame 10 relates to the payoff at frame 90. Transformers, with their global self-attention, are a natural fit. The challenge: video has too many tokens for full attention.
Tokenization: just like ViT splits an image into patches, video transformers split a video into tubelets — 3D patches spanning time, height, and width. A video [T, 3, H, W] with tubelet size [t, p, p] produces (T/t) × (H/p) × (W/p) tokens, each a flattened vector of size t × p × p × 3 projected to dimension D.
For T=16, H=W=224, t=2, p=16: that's 8 × 14 × 14 = 1568 tokens. Full self-attention on 1568 tokens costs O(1568²) = 2.5M operations per layer. Manageable, but it grows fast with longer videos.
ViViT (Arnab et al., 2021) explored four factorization strategies and found that a two-stage approach works best: a spatial encoder processes each frame independently, then a temporal encoder processes the sequence of per-frame CLS tokens. This is late temporal fusion at the transformer level.
VideoMAE (Tong et al., 2022) took self-supervised pretraining to video: mask 90% of tubelet tokens, and train the transformer to reconstruct them. Why 90%? Because video has enormous temporal redundancy — neighboring frames are nearly identical. Masking 90% forces the model to learn actual motion and structure, not just copy from nearby frames. After pretraining, fine-tune for classification.
Compare attention patterns. Full attention connects every token to every other. Factored attention separates space and time. Adjust grid size to see how cost scales.
So far we've been classifying trimmed clips — short videos containing exactly one action. In the real world, videos are untrimmed. A 2-hour movie contains hundreds of actions at different times. Temporal action detection answers: when does each action happen?
The output isn't a single label — it's a list of (start time, end time, action class, confidence score) for every detected action instance. Think of it as object detection, but in 1D (time) instead of 2D (image).
ActionFormer (Zhang et al., 2022) builds a feature pyramid over the temporal dimension, with levels at different scales (short actions at fine scale, long actions at coarse scale). At each level, local self-attention attends over nearby time steps. Each time step predicts: (1) the action class, and (2) the distance to the start and end of the action. This is analogous to how FCOS detects objects in images — anchor-free, per-point regression.
A long video with multiple actions at different times. The detector outputs colored segments with confidence scores. Click Detect to run.
| Method | Approach | mAP on ActivityNet |
|---|---|---|
| BMN (2019) | Proposal + classification | 50.1% |
| VSGN (2021) | Graph-based proposals | 52.4% |
| ActionFormer (2022) | Anchor-free pyramid + local attn | 54.7% |
| TriDet (2023) | Trident head on pyramid | 55.4% |
All the methods so far process raw pixels. But actions are fundamentally about body movement. What if we skip the pixels entirely and work with skeleton keypoints — the (x, y) coordinates of body joints over time? This is the idea behind skeleton-based action recognition.
A pose estimator (like OpenPose or HRNet) extracts N joint positions per frame. For a standard body model, N=17 joints (nose, eyes, shoulders, elbows, wrists, hips, knees, ankles). Over T frames, the input is a tensor of shape [N, 3, T] — N joints, each with (x, y, confidence), across T time steps. This is dramatically smaller than raw video: 17 × 3 × 64 = 3,264 numbers vs 64 × 3 × 224 × 224 = 9.6 million.
ST-GCN (Yan et al., 2018) defines two types of edges: spatial edges (bones connecting joints in a single frame — shoulder-to-elbow, hip-to-knee) and temporal edges (the same joint across consecutive frames — left wrist at time t to left wrist at time t+1). A graph convolution aggregates features from neighboring nodes:
Where N(v) is the set of neighbors of joint v in the spatiotemporal graph, and W(u) is a learnable weight that depends on the relative position of u to v (center, centripetal, centrifugal partitioning). Stack multiple ST-GCN layers, and the receptive field grows — a wrist learns about what the elbow and shoulder are doing, which in turn know about the torso.
A skeleton performing an action over time. Orange = spatial edges (bones). Teal dotted = temporal edges (same joint across time). Click Play to animate. The GCN "reads" both edge types simultaneously.
Why skeletons matter: they're invariant to background, lighting, clothing, and camera angle. A "waving" skeleton looks the same whether you're in a park or a kitchen, wearing red or blue. This makes skeleton models excellent for cross-domain generalization. The tradeoff: you lose object and scene context (can't distinguish "eating an apple" from "eating a sandwich" by skeleton alone).
Training an action recognition model is expensive. A single video clip is 16–64 frames, each a full image. Backpropagating through a 3D CNN or video transformer on a batch of clips requires 4–16× more memory than image training. Here's how practitioners handle it.
You can't feed an entire video to the network. Instead, sample short clips during training. Common strategies:
| Strategy | Method | Used by |
|---|---|---|
| Uniform | Divide video into T segments, sample 1 frame per segment | TSN, TSM |
| Random | Pick a random start point, take T consecutive frames | C3D, I3D |
| Multi-clip | At test time, sample K clips from different positions, average scores | SlowFast, ViViT |
| Dataset | Classes | Clips | Tests |
|---|---|---|---|
| Kinetics-400 | 400 | ~306K | Appearance + motion |
| Something-Something v2 | 174 | ~221K | Temporal reasoning (egocentric) |
| AVA v2.2 | 80 | 430 videos | Spatiotemporal detection (who does what where) |
| EPIC-Kitchens-100 | 97 verbs, 300 nouns | 90K segments | Egocentric, fine-grained |
| ActivityNet | 200 | ~20K | Untrimmed temporal detection |
Kinetics is the ImageNet of video — large-scale, diverse, good for pretraining. Something-Something is the stress test — "pushing something from left to right" requires understanding motion direction, not just scene context. Models that cheat with scene bias fail here. AVA combines detection with recognition — for each person in each keyframe, predict a set of actions (one person can be "walking" and "talking on phone" simultaneously).
Training a SlowFast R-101 on Kinetics-400 for 256 epochs takes ~128 GPU-days on V100s. A ViViT-L takes ~200 GPU-days. VideoMAE pretraining adds another ~100 GPU-days. This is why transfer learning from Kinetics is standard — most labs can't afford to train from scratch.
A video timeline with 32 frames. See how uniform, random, and dense sampling select different frames. Click Sample to resample.
Action recognition has evolved through a clear lineage, each generation addressing the limitations of the last:
| Era | Method | Key idea | Limitation |
|---|---|---|---|
| 2014 | Two-Stream CNNs | Separate appearance + motion | Requires precomputed optical flow |
| 2015 | C3D / 3D CNNs | Learn spatiotemporal features jointly | No ImageNet pretraining, small datasets |
| 2017 | I3D (inflate 2D→3D) | Transfer ImageNet features to video | Fixed temporal window, expensive |
| 2019 | SlowFast | Dual frame rate, asymmetric design | Still CNN-based, limited long-range |
| 2021 | TimeSformer / ViViT | Transformer attention over space+time | Quadratic cost, needs large data |
| 2022 | VideoMAE | Self-supervised pretraining for video | Pretraining cost, still clip-level |
| 2023+ | Video foundation models | Text-video pretraining, zero-shot | Massive compute, emerging field |
Upstream: Transformers (the attention mechanism behind video ViTs), Contrastive Learning & CLIP (text-video alignment for zero-shot recognition).
Downstream: Vision-Language Models (describing actions in natural language), World Models (predicting future actions and states from video).
Adjacent: Vision-Language-Action Models (from recognizing actions to performing them in robots).
"The purpose of computing is insight, not numbers." — Richard Hamming