Find when actions happen in untrimmed video using local self-attention on a multiscale temporal feature pyramid. No proposals, no anchors — classify every moment and regress its boundaries in a single shot.
You have a 30-minute untrimmed video of a soccer match. Somewhere in there, a player does a "bicycle kick" from 12:03 to 12:05, a "header" from 18:30 to 18:32, and a "tackle" from 22:10 to 22:14. The goal of Temporal Action Localization (TAL) is to find all action instances: their start time, end time, and category label.
This is much harder than action classification (which tells you if a 3-second clip contains a kick — yes or no). TAL must answer: when does the kick start, when does it end, and what is happening at every other moment (background or another action)?
Prior approaches fall into two camps:
ActionFormer takes the simplest possible approach: classify every single moment in the video as either background or one of C action categories, and regress the distance from that moment to the nearest action boundary. No proposals. No anchors. Just a Transformer that looks at every moment and asks: "Is this an action? If so, how far are its boundaries?"
An important design decision: ActionFormer does not process raw video frames. It operates on pre-extracted features from a frozen video backbone (I3D, SlowFast). Each input "time step" is a feature vector representing a 1-second video clip, not a single frame. This decouples the temporal reasoning from the visual representation, just as FCOS decouples detection from feature extraction.
A timeline with embedded action instances (colored). The model must detect each action's start, end, and label. Most of the video is background (gray).
ActionFormer's core insight: treat every moment as an action candidate and let the model decide. Instead of generating proposals (two-stage) or using pre-defined anchor windows (single-stage), just classify every time step and regress its boundaries directly.
This is the temporal equivalent of what FCOS did for object detection: instead of anchor boxes, every spatial location predicts a class and distances to the box edges. ActionFormer does the same thing in 1D: every temporal location predicts a class and distances to the action's onset and offset.
For every time step t in the video, ActionFormer outputs:
From these, decoding an action is trivial: start = t − ds, end = t + de, label = argmax p(at).
ActionFormer follows a clean encoder-decoder design. The encoder is a multiscale Transformer that builds a temporal feature pyramid. The decoder is a lightweight convolutional network with shared heads.
The input features X = {x1, ..., xT} (each xt ∈ RDin, e.g., Din = 2048 for I3D) are projected to D = 512 dimensions using a shallow 1D convolutional network with ReLU activation:
Adding convolutions before the Transformer was found to help incorporate local context and stabilize training.
Each block applies local multi-headed self-attention (MSA) followed by an MLP, with LayerNorm before each and residual connections after:
where αl and ᾱl are per-channel learnable scaling factors, and ↓ is optional 2x downsampling via strided depthwise 1D convolution. The first 2 blocks operate at full resolution T; the remaining 5 blocks downsample by 2x each, creating pyramid levels at T, T/2, T/4, T/8, T/16, T/32.
Two separate but architecturally identical convolutional heads are attached to every pyramid level:
Both heads share weights across all pyramid levels. A coarser level naturally has larger regression targets (longer actions live at coarser resolutions), so the regression range is normalized by the feature stride of each level. This normalization is crucial: without it, the regression head would need to output values spanning from 2 time steps (short action at level 1) to 500+ time steps (long action at level 7) — an impossible dynamic range for a shared network. With stride normalization, all targets are in the same manageable range.
Pre-extracted features flow through projection, Transformer blocks with downsampling, and shared decoder heads at each pyramid level.
The feature pyramid is ActionFormer's secret weapon. It elegantly solves the problem of variable-duration actions by distributing different temporal scales across different levels.
A "long jump" lasts 3 seconds. A "cooking" activity lasts 3 minutes. If you only look at the finest temporal resolution, detecting a 3-minute activity requires a receptive field of 180 time steps (at 1 fps). That's impractical with local attention. But at 32x downsampled resolution, those 180 steps become just 6 — easily captured by a window of 19.
With L = 7 Transformer blocks and 2x downsampling on the last 5:
| Level | Resolution | Stride | Local Window | Effective Range | Regression Range |
|---|---|---|---|---|---|
| Z1 | T | 1 | 19 | 19 steps | [0, 4) |
| Z2 | T | 1 | 19 | 19 steps | [4, 8) |
| Z3 | T/2 | 2 | 19 | 38 steps | [8, 16) |
| Z4 | T/4 | 4 | 19 | 76 steps | [16, 32) |
| Z5 | T/8 | 8 | 19 | 152 steps | [32, 64) |
| Z6 | T/16 | 16 | 19 | 304 steps | [64, 128) |
| Z7 | T/32 | 32 | 19 | 608 steps | [128, ∞) |
Each level specializes in a different duration range. Short actions (a few seconds) are detected at levels Z1-Z2. Long actions (several minutes) are detected at Z6-Z7. The regression range per level is normalized by the stride, roughly doubling with each level.
The pyramid has 7 levels at decreasing temporal resolutions. Short actions are detected at fine levels (top), long actions at coarse levels (bottom). The orange boxes show the effective temporal range at each level.
Standard (global) self-attention computes similarity between every pair of time steps. For a video with T = 2048 time steps, that's T2 = 4 million pairs. This is O(T2D) in both time and memory — prohibitively expensive for long videos.
ActionFormer's solution: local self-attention. Each time step only attends to its W nearest neighbors (window size W = 19). This reduces complexity from O(T2D) to O(W2TD) — and since W is a small constant (19 << T), this is effectively O(TD).
For each time step t, self-attention is computed only within the window [t - W/2, t + W/2]. The queries, keys, and values are computed as usual:
The key observation: temporal context beyond a certain range is less helpful for action localization. Whether something happened 5 minutes ago rarely matters for detecting the current action. But what happened 10 seconds ago matters a lot. Local attention captures exactly this inductive bias.
The ablation confirms this: replacing local attention (W=19) with global attention actually decreases average mAP by 0.9% on THUMOS14 (from 66.8% to 65.9%). Global attention dilutes the model's focus with irrelevant distant context and increases computation. Local attention is both cheaper and better.
Surprisingly, ActionFormer works better without positional encoding. The ablation shows that adding sinusoidal or learned positional encodings slightly hurts performance. The reason: the projection convolutions and strided depthwise convolutions in the Transformer blocks already leak positional information (convolutions are inherently position-aware through their structure). Adding explicit positional encoding is redundant and slightly harmful.
This also means ActionFormer can process any length video at inference time — there's no positional encoding that was trained for a specific sequence length.
Left: global attention (O(T2) pairs). Right: local attention with window W=5 (O(W·T) pairs). Toggle between them to see the difference in attended positions for the selected time step (orange).
The decoder is deliberately simple — the heavy lifting is done by the Transformer encoder. Two lightweight 1D convolutional heads are applied to every time step at every pyramid level.
Three layers of 1D convolutions (kernel size 3, 512 channels), with LayerNorm on the first two layers and ReLU activation. The final layer outputs C channels (one per action category) followed by sigmoid. This is a multi-label formulation: each category is an independent binary decision. A time step can theoretically be labeled with multiple actions simultaneously (though this is rare in practice).
Same architecture as classification. Output: 2 channels (ds, de) followed by ReLU to ensure positive distances. The distances are normalized by the stride of the current pyramid level, so the network always predicts in "level-relative" units regardless of the actual temporal scale.
The total loss per video has two terms:
Where:
Standard L1 or L2 regression on (ds, de) has a problem: when the predicted segment and the ground truth segment don't overlap at all, L1/L2 still gives a gradient, but it doesn't reflect how close they are in terms of temporal overlap. DIoU (Distance IoU) loss combines two signals:
The regression loss is only applied to positive samples (time steps within an action). Background moments have no meaningful (ds, de) target.
ActionFormer is trained end-to-end with Adam optimizer on the combined classification + regression loss. Several training tricks are critical for performance.
Not every time step within an action is labeled as positive. Only time steps near the center of the action (within α = 1.5 strides of the center) are considered positive for training. This has two benefits:
Center sampling adds +1.4% average mAP on THUMOS14. It doesn't affect inference — at test time, every time step still makes a prediction.
| Setting | THUMOS14 | ActivityNet 1.3 | EPIC-Kitchens 100 |
|---|---|---|---|
| Features | I3D (2-stream) | I3D / R(2+1)D+TSP | SlowFast |
| Feature dim | 2048 | 2048 | 2304 |
| Max seq length | 2304 | 192 (downsampled) | 2304 |
| Attention window | 19 | 11 | 9 |
| Epochs | 30 | 15 | 30 |
| Optimizer | Adam | Adam | Adam |
| Learning rate | 1e-4 | 1e-4 | 1e-4 |
| Warmup | 5 epochs | 5 epochs | 5 epochs |
Videos vary wildly in duration: THUMOS14 videos can be 2-6 minutes, ActivityNet videos can be 5-30+ minutes. During training, sequences are padded or cropped to a fixed maximum length, with proper attention masking to prevent padded positions from influencing the output. This is equivalent to training with sliding windows.
An important finding: varying the maximum input sequence length during training has little impact on performance. The model generalizes to different lengths at inference because (a) there's no positional encoding, and (b) local self-attention doesn't depend on absolute position — only relative context within the window matters.
At inference, the full video sequence is fed through ActionFormer in a single forward pass. Since there's no positional encoding, the model handles any video length.
The pre-extracted features X = {x1, ..., xT} go through the encoder, producing the pyramid Z = {Z1, ..., ZL}. The shared heads produce (p(at), dst, det) for every time step t at every level l.
Each time step t at level l produces a candidate action:
With L = 7 levels and a video of T = 2048 steps, this produces roughly T + T/2 + T/4 + ... = ~4000 candidates. Most will be background (low p(at)) and are filtered by a confidence threshold.
Multiple candidates may overlap in time (especially from adjacent time steps and adjacent pyramid levels). Soft-NMS suppresses overlapping detections by decaying their confidence scores rather than hard-deleting them. This is gentler than standard NMS and preserves detections of overlapping actions (e.g., simultaneous "running" and "dribbling").
Each pyramid level independently produces detections. A short "kick" action detected at level Z1 might also produce a (weaker) detection at level Z3. These cross-level duplicates are handled naturally by Soft-NMS: the higher-confidence detection survives, and the duplicate's score gets decayed. This is why a single round of Soft-NMS suffices — it handles both within-level and cross-level duplicates simultaneously.
The regression targets at each level are clipped to a predefined range. Level Z1 (stride 1) only predicts actions with ds + de in [0, 4) time steps. Level Z7 (stride 32) predicts actions with ds + de in [128, ∞). This prevents a fine-resolution level from trying to predict a 5-minute action (which would require huge regression values) and vice versa.
At inference, the dominant cost is the Transformer encoder's self-attention. With local attention (window W = 19) and L = 7 blocks, the cost per block is O(W · Tl · D) where Tl is the sequence length at level l. Since Tl halves at each level (after the first two), the total cost is roughly O(W · T · D · L) — linear in video length. A 30-minute video processed in a single forward pass takes well under a second on a modern GPU, making ActionFormer practical for real-world applications.
Watch how raw predictions from multiple pyramid levels are decoded into candidate actions, then refined by Soft-NMS into final detections.
ActionFormer establishes new state of the art on all three major TAL benchmarks, surpassing both two-stage and single-stage methods by large margins.
| Method | Type | mAP@0.5 ↑ | mAP@0.7 ↑ | Avg mAP ↑ |
|---|---|---|---|---|
| BMN | Two-stage | 38.8% | 20.5% | 38.5% |
| G-TAD | Two-stage | 40.3% | 23.4% | 39.3% |
| MUSES | Two-stage | 56.9% | 31.0% | — |
| AFSD | Single-stage | 55.5% | 31.1% | 52.0% |
| TadTR | Single-stage | 49.2% | 26.3% | 46.6% |
| ActionFormer | Single-stage | 71.0% | 43.9% | 66.8% |
ActionFormer achieves 71.0% mAP at tIoU=0.5 — +14.1 absolute percentage points over the best prior single-stage method (AFSD at 55.5%) and +14.1 over the best two-stage method (MUSES at 56.9%).
| Method | Verb Avg mAP ↑ | Noun Avg mAP ↑ |
|---|---|---|
| BMN | 8.4% | 6.5% |
| G-TAD | 9.4% | 8.4% |
| ActionFormer | 23.5% | 21.9% |
On the challenging egocentric dataset, ActionFormer outperforms BMN/G-TAD by +13.5 average mAP.
| Change | Avg mAP | Δ |
|---|---|---|
| Full ActionFormer | 66.8% | — |
| Replace Transformer with 1D ConvNet | 52.9% | −13.9% |
| Remove LayerNorm in heads | 62.7% | −4.1% |
| Remove center sampling | 65.4% | −1.4% |
| Add positional encoding | 66.6% | −0.2% |
| Global attention (no local window) | 65.9% | −0.9% |
Average mAP on THUMOS14 [0.3:0.1:0.7]. ActionFormer (rightmost) vastly outperforms all prior methods.
ActionFormer bridges ideas from object detection, NLP sequence modeling, and video understanding into a clean, unified design.
ActionFormer is essentially FCOS adapted to 1D. FCOS classifies every spatial location and regresses distances to the bounding box edges. ActionFormer classifies every temporal location and regresses distances to action boundaries. The feature pyramid, center sampling, and anchor-free design all come directly from the FCOS lineage (FPN → RetinaNet → FCOS → ActionFormer).
The mapping is almost one-to-one: FPN's 2D feature maps become 1D temporal sequences. FCOS's 2D center sampling becomes 1D center sampling. RetinaNet's focal loss is adopted without changes. Even the regression normalization by stride follows FCOS exactly. This direct inheritance is a strength — decades of object detection research transfers cleanly to the temporal domain.
TadTR (a concurrent work) uses a DETR-style set prediction approach for TAL: learned object queries, Hungarian matching, no feature pyramid. ActionFormer takes the opposite approach: no learned queries, no bipartite matching, just dense per-moment prediction with NMS. ActionFormer significantly outperforms TadTR (66.8% vs 46.6% on THUMOS14), suggesting that the dense prediction paradigm is more natural for TAL than set prediction.
ActionFormer's local self-attention within a hierarchical pyramid directly parallels Swin Transformer's window attention with shifted windows in 2D. Both use local attention for efficiency and rely on the hierarchical structure for global context. ActionFormer skips the "shifted window" trick (unnecessary in 1D with the pyramid providing cross-level communication).
| Aspect | ActionFormer |
|---|---|
| Input | Pre-extracted video features [T, Din] |
| Output | (class, start, end) for each detected action |
| Encoder | 7-block multiscale Transformer, local attn W=19 |
| Decoder | Shared 1D conv heads (3 layers each) |
| Pyramid | 6 levels, 2x downsampling, strides 1–32 |
| Positional encoding | None (intentionally omitted) |
| Loss | Focal loss (cls) + DIoU loss (reg) |
| Post-processing | Soft-NMS |
| Key result | 71.0% mAP@0.5 on THUMOS14 (+14.1%) |
| Training | 30 epochs, single GPU, few hours |