ActionFormer — Veanors

Chapter 0: The Problem

You have a 30-minute untrimmed video of a soccer match. Somewhere in there, a player does a "bicycle kick" from 12:03 to 12:05, a "header" from 18:30 to 18:32, and a "tackle" from 22:10 to 22:14. The goal of Temporal Action Localization (TAL) is to find all action instances: their start time, end time, and category label.

This is much harder than action classification (which tells you if a 3-second clip contains a kick — yes or no). TAL must answer: when does the kick start, when does it end, and what is happening at every other moment (background or another action)?

The fundamental challenge: Actions vary wildly in duration. A "long jump" lasts 3 seconds. A "cooking activity" lasts 3 minutes. A model needs to capture temporal context at many scales simultaneously. And in a 30-minute video with only 5 action instances, over 99% of moments are background — extreme class imbalance.

Prior approaches fall into two camps:

Two-stage methods (BMN, G-TAD): First generate hundreds of "action proposals" (candidate temporal segments), then classify each one. Accurate but complex — the proposal generation itself is a multi-step pipeline.
Single-stage methods (A2Net, AFSD): Use anchor windows or sliding windows, classify each anchor directly. Simpler but limited by anchor design.

ActionFormer takes the simplest possible approach: classify every single moment in the video as either background or one of C action categories, and regress the distance from that moment to the nearest action boundary. No proposals. No anchors. Just a Transformer that looks at every moment and asks: "Is this an action? If so, how far are its boundaries?"

An important design decision: ActionFormer does not process raw video frames. It operates on pre-extracted features from a frozen video backbone (I3D, SlowFast). Each input "time step" is a feature vector representing a 1-second video clip, not a single frame. This decouples the temporal reasoning from the visual representation, just as FCOS decouples detection from feature extraction.

Full data flow at a glance: Untrimmed video → Pre-extracted clip features X = {x₁, ..., x_T} (e.g., I3D at 1 fps) → 1D convolution projection: [T, D_in] → [T, D] → L Transformer blocks with local self-attention + 2x downsampling → feature pyramid Z = {Z¹, ..., Z^L} at resolutions T, T/2, T/4, ..., T/2^L-1 → Shared classification head: every moment → C action probabilities (sigmoid, focal loss) → Shared regression head: every moment → (d_s, d_e) distances to action onset/offset (DIoU loss) → Soft-NMS → Final detections.

Temporal Action Localization

A timeline with embedded action instances (colored). The model must detect each action's start, end, and label. Most of the video is background (gray).

What makes temporal action localization harder than action classification?

TAL must find the precise start and end times of variable-duration actions within a long untrimmed video, not just classify a pre-trimmed clip — plus over 99% of moments are background TAL requires higher resolution video TAL needs more GPU memory for feature extraction

Chapter 1: The Key Insight

ActionFormer's core insight: treat every moment as an action candidate and let the model decide. Instead of generating proposals (two-stage) or using pre-defined anchor windows (single-stage), just classify every time step and regress its boundaries directly.

This is the temporal equivalent of what FCOS did for object detection: instead of anchor boxes, every spatial location predicts a class and distances to the box edges. ActionFormer does the same thing in 1D: every temporal location predicts a class and distances to the action's onset and offset.

The Representation

For every time step t in the video, ActionFormer outputs:

p(a_t): C probability values, one per action category (via sigmoid — multi-label, not softmax). If all are below threshold, t is background.
d_s^t: Distance from t to the action's onset (start). Always positive.
d_e^t: Distance from t to the action's offset (end). Always positive.

From these, decoding an action is trivial: start = t − d_s, end = t + d_e, label = argmax p(a_t).

Why this works so well: The key is the Transformer encoder. It captures long-range temporal context, so each moment's prediction is informed by what's happening hundreds of time steps away. A convolution with a 3× kernel only sees 3 time steps. A local self-attention window of size 19 sees 19 steps — and at the 5th pyramid level (16x downsampled), those 19 steps span 304 original time steps. The multiscale pyramid means short actions are detected at fine levels and long actions at coarse levels.

What happens when inputs degrade: ActionFormer operates on pre-extracted features, not raw video. If you replace I3D features with weaker TSN features, mAP drops from 66.8% to ~52%. The feature backbone quality is the single biggest factor. If the video is very long (10,000+ time steps), the fixed-length cropping during training means the model has only seen windows — but at inference, the full sequence is processed (no positional encoding), so it generalizes. With very few actions per video (ActivityNet: 1.5 avg), the extreme class imbalance hurts more than on THUMOS14 (15+ actions per video).

Input

Pre-extracted features X = {x₁, ..., x_T} from I3D, SlowFast, or R(2+1)D. Not raw frames.

↓

Encoder

Multiscale Transformer: project + L blocks with local self-attention + 2x downsampling → feature pyramid Z.

↓

Decoder

Shared 1D conv heads on every pyramid level: classify every moment + regress boundaries.

↓

Post-process

Decode (t, d_s, d_e) → (start, end, label). Soft-NMS to remove duplicates.

How does ActionFormer represent its output, and how are action boundaries decoded?

Every time step outputs class probabilities + distances to onset/offset; boundaries are decoded as start = t - d_s, end = t + d_e The model outputs a set of proposal segments ranked by confidence Each anchor window is classified and refined using bounding box regression

Chapter 2: The Architecture

ActionFormer follows a clean encoder-decoder design. The encoder is a multiscale Transformer that builds a temporal feature pyramid. The decoder is a lightweight convolutional network with shared heads.

Projection Layer

The input features X = {x₁, ..., x_T} (each x_t ∈ R^D_in, e.g., D_in = 2048 for I3D) are projected to D = 512 dimensions using a shallow 1D convolutional network with ReLU activation:

Z⁰ = [E(x₁), E(x₂), ..., E(x_T)]^T ∈ R^T×D

Adding convolutions before the Transformer was found to help incorporate local context and stabilize training.

Transformer Encoder (L = 7 blocks)

Each block applies local multi-headed self-attention (MSA) followed by an MLP, with LayerNorm before each and residual connections after:

Z̄^l = α^l · MSA(LN(Z^l-1)) + Z^l-1

Ẑ^l = ᾱ^l · MLP(LN(Z̄^l)) + Z̄^l

Z^l = ↓(Ẑ^l)

where α^l and ᾱ^l are per-channel learnable scaling factors, and ↓ is optional 2x downsampling via strided depthwise 1D convolution. The first 2 blocks operate at full resolution T; the remaining 5 blocks downsample by 2x each, creating pyramid levels at T, T/2, T/4, T/8, T/16, T/32.

Frozen vs. Trained: Video feature backbone (I3D, SlowFast): frozen — features are pre-extracted offline. Projection convolutions: trained. All 7 Transformer blocks: trained. Classification head: trained, shared across pyramid levels. Regression head: trained, shared across pyramid levels. No positional encoding (ablation showed it hurts). Total model is lightweight — the heavy lifting was done by the pre-extracted features.

Decoder

Two separate but architecturally identical convolutional heads are attached to every pyramid level:

Classification head: 3 layers of 1D conv (kernel=3, LayerNorm on first 2, ReLU) → sigmoid → C action probabilities per time step.
Regression head: Same architecture → ReLU at output (distances are positive) → (d_s, d_e) per time step.

Both heads share weights across all pyramid levels. A coarser level naturally has larger regression targets (longer actions live at coarser resolutions), so the regression range is normalized by the feature stride of each level. This normalization is crucial: without it, the regression head would need to output values spanning from 2 time steps (short action at level 1) to 500+ time steps (long action at level 7) — an impossible dynamic range for a shared network. With stride normalization, all targets are in the same manageable range.

ActionFormer Pipeline

Pre-extracted features flow through projection, Transformer blocks with downsampling, and shared decoder heads at each pyramid level.

Why does ActionFormer share the classification and regression head weights across all pyramid levels?

Sharing weights reduces parameters and enforces that the same detection logic works at all temporal scales — the pyramid already handles scale differences through its multi-resolution structure Because each pyramid level has the same number of time steps To allow gradient flow between levels during backpropagation

Chapter 3: The Temporal Feature Pyramid

The feature pyramid is ActionFormer's secret weapon. It elegantly solves the problem of variable-duration actions by distributing different temporal scales across different levels.

Why a Pyramid?

A "long jump" lasts 3 seconds. A "cooking" activity lasts 3 minutes. If you only look at the finest temporal resolution, detecting a 3-minute activity requires a receptive field of 180 time steps (at 1 fps). That's impractical with local attention. But at 32x downsampled resolution, those 180 steps become just 6 — easily captured by a window of 19.

The Structure

With L = 7 Transformer blocks and 2x downsampling on the last 5:

Level	Resolution	Stride	Local Window	Effective Range	Regression Range
Z¹	T	1	19	19 steps	[0, 4)
Z²	T	1	19	19 steps	[4, 8)
Z³	T/2	2	19	38 steps	[8, 16)
Z⁴	T/4	4	19	76 steps	[16, 32)
Z⁵	T/8	8	19	152 steps	[32, 64)
Z⁶	T/16	16	19	304 steps	[64, 128)
Z⁷	T/32	32	19	608 steps	[128, ∞)

Each level specializes in a different duration range. Short actions (a few seconds) are detected at levels Z¹-Z². Long actions (several minutes) are detected at Z⁶-Z⁷. The regression range per level is normalized by the stride, roughly doubling with each level.

Why 2x downsampling, not 4x or 8x? 2x downsampling is the sweet spot. It provides fine-grained scale coverage (every scale from a few seconds to 10+ minutes) without too many or too few levels. The design directly mirrors FPN and FCOS from object detection, adapted to 1D. Ablation confirms: 2x is optimal on THUMOS14 (66.8% avg mAP vs. 65.1% with 4x).

Downsampling implementation: A strided depthwise 1D convolution with stride=2. Depthwise means each channel is convolved independently, then mixed. This is more expressive than simple average pooling (which was also tested and slightly worse), while being efficient. The convolution learns what information to preserve when halving the resolution.

Temporal Feature Pyramid

The pyramid has 7 levels at decreasing temporal resolutions. Short actions are detected at fine levels (top), long actions at coarse levels (bottom). The orange boxes show the effective temporal range at each level.

How does the temporal feature pyramid handle actions of vastly different durations?

By using different model weights for short and long actions By running the model multiple times at different temporal resolutions Each pyramid level naturally specializes in a different duration range because 2x downsampling at each level doubles the effective temporal receptive field, with regression ranges scaled accordingly

Chapter 4: Local Self-Attention

Standard (global) self-attention computes similarity between every pair of time steps. For a video with T = 2048 time steps, that's T² = 4 million pairs. This is O(T²D) in both time and memory — prohibitively expensive for long videos.

ActionFormer's solution: local self-attention. Each time step only attends to its W nearest neighbors (window size W = 19). This reduces complexity from O(T²D) to O(W²TD) — and since W is a small constant (19 << T), this is effectively O(TD).

The Mechanism

For each time step t, self-attention is computed only within the window [t - W/2, t + W/2]. The queries, keys, and values are computed as usual:

Q = Z^lW_Q, K = Z^lW_K, V = Z^lW_V

S_t = softmax(Q_t · K_{[t-W/2:t+W/2]}^T / √D_q) · V_{[t-W/2:t+W/2]}

The key observation: temporal context beyond a certain range is less helpful for action localization. Whether something happened 5 minutes ago rarely matters for detecting the current action. But what happened 10 seconds ago matters a lot. Local attention captures exactly this inductive bias.

The ablation confirms this: replacing local attention (W=19) with global attention actually decreases average mAP by 0.9% on THUMOS14 (from 66.8% to 65.9%). Global attention dilutes the model's focus with irrelevant distant context and increases computation. Local attention is both cheaper and better.

Local attention + pyramid = global context: Local attention alone would limit the model to a 19-step temporal range. But on the pyramid, the same 19-step window at level Z⁶ (stride 16) covers 19 × 16 = 304 original time steps. At Z⁷ (stride 32), it covers 608 steps. So the model has both fine-grained local context (lower levels) and broad global context (upper levels), without ever computing global attention.

No Positional Encoding

Surprisingly, ActionFormer works better without positional encoding. The ablation shows that adding sinusoidal or learned positional encodings slightly hurts performance. The reason: the projection convolutions and strided depthwise convolutions in the Transformer blocks already leak positional information (convolutions are inherently position-aware through their structure). Adding explicit positional encoding is redundant and slightly harmful.

This also means ActionFormer can process any length video at inference time — there's no positional encoding that was trained for a specific sequence length.

Engineering decision — learnable scaling: Each Transformer block has learnable per-channel scaling factors α^l and ᾱ^l (initialized to small values). These are multiplied with the MSA and MLP outputs before the residual connection: Z̄^l = α^l · MSA(...) + Z^l-1. This stabilizes early training by starting with near-identity residual connections and gradually increasing the Transformer's contribution. Without these factors, training is unstable for deep pyramids.

Multi-head attention details: ActionFormer uses 8 attention heads with D = 512 dimensional features, so each head operates on 64 dimensions. The local window is applied identically across all heads — no head attends globally. The MLP in each Transformer block follows the standard design: Linear(D, 4D) → GELU → Linear(4D, D), expanding to 2048 intermediate dimensions.

Local vs. Global Self-Attention

Left: global attention (O(T²) pairs). Right: local attention with window W=5 (O(W·T) pairs). Toggle between them to see the difference in attended positions for the selected time step (orange).

Mode: local

Why does ActionFormer NOT use positional encoding?

The projection convolutions and strided depthwise convolutions already encode positional information implicitly, making explicit positional encoding redundant and slightly harmful — and omitting it allows variable-length inference Because the model only processes short videos where position doesn't matter To reduce the number of parameters in the model

Chapter 5: The Decoder Heads

The decoder is deliberately simple — the heavy lifting is done by the Transformer encoder. Two lightweight 1D convolutional heads are applied to every time step at every pyramid level.

Classification Head

Three layers of 1D convolutions (kernel size 3, 512 channels), with LayerNorm on the first two layers and ReLU activation. The final layer outputs C channels (one per action category) followed by sigmoid. This is a multi-label formulation: each category is an independent binary decision. A time step can theoretically be labeled with multiple actions simultaneously (though this is rare in practice).

Regression Head

Same architecture as classification. Output: 2 channels (d_s, d_e) followed by ReLU to ensure positive distances. The distances are normalized by the stride of the current pyramid level, so the network always predicts in "level-relative" units regardless of the actual temporal scale.

Why LayerNorm in the heads? Adding LayerNorm to the first two conv layers of the classification head boosted average mAP by 2.7% on THUMOS14. Without it, the features from different pyramid levels have different magnitudes (deeper levels have larger activations due to more self-attention layers). LayerNorm normalizes these, allowing the shared heads to work consistently across all levels.

The Training Loss

The total loss per video has two terms:

L = ∑_t (L_cls + λ_reg · 1_{c_t} · L_reg) / T₊

Where:

L_cls = Focal Loss for C-way binary classification. Focal loss down-weights easy negatives (background moments that the model is already confident about), focusing training on hard cases near action boundaries. Parameters: γ = 2.0, α = 0.75.
L_reg = DIoU Loss for distance regression, only applied at positive time steps (1_{c_t} = 1 if t is within an action). DIoU combines intersection-over-union with a distance penalty, giving gradients even when predicted and target segments don't overlap.
T₊ = total number of positive samples across the video. Normalizing by T₊ prevents batches with many actions from dominating.
λ_reg = 1.0 by default (equal weight for classification and regression).

Focal loss is essential: In a 30-minute video at 1 fps = 1800 time steps. If there are 10 action instances averaging 5 seconds each, that's 50 positive steps and 1750 negative steps (97% background). Without focal loss, the massive negative class dominates the gradient, and the model learns to predict "background" everywhere. Focal loss with γ=2 reduces the loss contribution of well-classified negatives by a factor of ~100, letting the model focus on the tricky moments near action boundaries.

Why DIoU Loss for Regression?

Standard L1 or L2 regression on (d_s, d_e) has a problem: when the predicted segment and the ground truth segment don't overlap at all, L1/L2 still gives a gradient, but it doesn't reflect how close they are in terms of temporal overlap. DIoU (Distance IoU) loss combines two signals:

IoU component: Measures the temporal overlap between the predicted and ground truth segments. Ranges from 0 (no overlap) to 1 (perfect match).
Distance penalty: Penalizes the normalized distance between the centers of the two segments. This gives useful gradients even when IoU = 0 (no overlap), guiding the prediction toward the target.

The regression loss is only applied to positive samples (time steps within an action). Background moments have no meaningful (d_s, d_e) target.

Tensor shapes through the decoder: At each pyramid level l, the features Z^l ∈ R^{T^l×D} are fed through the classification head: Conv1d(D, D, 3) + LN + ReLU → Conv1d(D, D, 3) + LN + ReLU → Conv1d(D, C, 3) + sigmoid → [T^l, C]. Regression head: same but → Conv1d(D, 2, 3) + ReLU → [T^l, 2]. Total across all levels: ∑ T^l predictions. Both heads share weights across levels.

Why is focal loss essential for ActionFormer's training?

Because over 97% of time steps are background, and focal loss down-weights the massive number of easy negatives to focus learning on hard cases near action boundaries Because focal loss produces smoother gradients than cross-entropy Because focal loss supports multi-label classification

Chapter 6: Training Details

ActionFormer is trained end-to-end with Adam optimizer on the combined classification + regression loss. Several training tricks are critical for performance.

Center Sampling

Not every time step within an action is labeled as positive. Only time steps near the center of the action (within α = 1.5 strides of the center) are considered positive for training. This has two benefits:

It focuses the model on predicting from the most informative position (the center, where both boundaries are equally far away).
It reduces confusion at action boundaries, where the model might be uncertain about which action a time step belongs to.

Center sampling adds +1.4% average mAP on THUMOS14. It doesn't affect inference — at test time, every time step still makes a prediction.

Training Configuration

Setting	THUMOS14	ActivityNet 1.3	EPIC-Kitchens 100
Features	I3D (2-stream)	I3D / R(2+1)D+TSP	SlowFast
Feature dim	2048	2048	2304
Max seq length	2304	192 (downsampled)	2304
Attention window	19	11	9
Epochs	30	15	30
Optimizer	Adam	Adam	Adam
Learning rate	1e-4	1e-4	1e-4
Warmup	5 epochs	5 epochs	5 epochs

Training is fast: THUMOS14 has only 200 training videos. At 30 epochs with pre-extracted features, training takes a few hours on a single GPU. No multi-day training runs, no 8-GPU clusters. The expensive part is feature extraction (I3D, SlowFast), which is done offline once. This is a major practical advantage: you can iterate on the model quickly.

Warmup is critical: Without the 5-epoch learning rate warmup, training often diverges — the Transformer's self-attention is sensitive to initial weight scales. The learnable scaling factors α help, but warmup provides additional stability. This was also observed in ViT and DeiT for vision Transformers.

Variable-Length Handling

Videos vary wildly in duration: THUMOS14 videos can be 2-6 minutes, ActivityNet videos can be 5-30+ minutes. During training, sequences are padded or cropped to a fixed maximum length, with proper attention masking to prevent padded positions from influencing the output. This is equivalent to training with sliding windows.

An important finding: varying the maximum input sequence length during training has little impact on performance. The model generalizes to different lengths at inference because (a) there's no positional encoding, and (b) local self-attention doesn't depend on absolute position — only relative context within the window matters.

Dataset characteristics drive design choices: THUMOS14 has ~15 actions per video in short videos → many positive samples, balanced training, window size 19 works well. ActivityNet has ~1.5 actions per extremely long videos → extreme imbalance, features downsampled to fixed length 192, window size 11 (less context needed since features are already compressed). EPIC-Kitchens has ~128 actions per video in long egocentric sequences → many overlapping fine-grained actions, window size 9 (egocentric actions are local). The attention window size is the main per-dataset hyperparameter.

What is center sampling and why does it help?

Only time steps near the center of an action are labeled positive during training, which focuses the model on the most informative position and reduces boundary confusion — adding +1.4% mAP It samples training batches from the center of the dataset It centers the feature representations using batch normalization

Chapter 7: Inference & Decoding

At inference, the full video sequence is fed through ActionFormer in a single forward pass. Since there's no positional encoding, the model handles any video length.

Step 1: Forward Pass

The pre-extracted features X = {x₁, ..., x_T} go through the encoder, producing the pyramid Z = {Z¹, ..., Z^L}. The shared heads produce (p(a_t), d_s^t, d_e^t) for every time step t at every level l.

Step 2: Decode Candidates

Each time step t at level l produces a candidate action:

a_t = argmax p(a_t), s_t = t − d_s^t, e_t = t + d_e^t

With L = 7 levels and a video of T = 2048 steps, this produces roughly T + T/2 + T/4 + ... = ~4000 candidates. Most will be background (low p(a_t)) and are filtered by a confidence threshold.

Step 3: Soft-NMS

Multiple candidates may overlap in time (especially from adjacent time steps and adjacent pyramid levels). Soft-NMS suppresses overlapping detections by decaying their confidence scores rather than hard-deleting them. This is gentler than standard NMS and preserves detections of overlapping actions (e.g., simultaneous "running" and "dribbling").

Why Soft-NMS, not standard NMS? Standard NMS with IoU threshold 0.5 would delete all overlapping detections, keeping only the highest-confidence one. But in TAL, actions can genuinely overlap in time (you can "talk" while "walking"). Soft-NMS decays overlapping scores by a Gaussian function: score → score · exp(−IoU² / σ). This preserves truly overlapping actions while still suppressing duplicate detections of the same action.

Multi-Level Decoding

Each pyramid level independently produces detections. A short "kick" action detected at level Z¹ might also produce a (weaker) detection at level Z³. These cross-level duplicates are handled naturally by Soft-NMS: the higher-confidence detection survives, and the duplicate's score gets decayed. This is why a single round of Soft-NMS suffices — it handles both within-level and cross-level duplicates simultaneously.

The regression targets at each level are clipped to a predefined range. Level Z¹ (stride 1) only predicts actions with d_s + d_e in [0, 4) time steps. Level Z⁷ (stride 32) predicts actions with d_s + d_e in [128, ∞). This prevents a fine-resolution level from trying to predict a 5-minute action (which would require huge regression values) and vice versa.

Score fusion (optional): On ActivityNet, ActionFormer can optionally fuse its per-moment action scores with external video-level classification scores (from a separate model). This helps because ActivityNet videos typically contain only 1-2 actions, making video-level context useful. On THUMOS14 (15+ actions per video), this helps less. With score fusion, ActionFormer reaches 36.6% average mAP on ActivityNet, outperforming all methods using the same features.

Computational Cost

At inference, the dominant cost is the Transformer encoder's self-attention. With local attention (window W = 19) and L = 7 blocks, the cost per block is O(W · T^l · D) where T^l is the sequence length at level l. Since T^l halves at each level (after the first two), the total cost is roughly O(W · T · D · L) — linear in video length. A 30-minute video processed in a single forward pass takes well under a second on a modern GPU, making ActionFormer practical for real-world applications.

Inference Pipeline

Watch how raw predictions from multiple pyramid levels are decoded into candidate actions, then refined by Soft-NMS into final detections.

Why does ActionFormer use Soft-NMS instead of standard NMS?

Because actions can genuinely overlap in time (e.g., talking while walking), and Soft-NMS preserves overlapping detections by decaying scores instead of deleting them Because Soft-NMS is faster to compute Because standard NMS requires a fixed IoU threshold that is hard to tune

Chapter 8: Results

ActionFormer establishes new state of the art on all three major TAL benchmarks, surpassing both two-stage and single-stage methods by large margins.

THUMOS14 (20 action categories, 413 videos)

Method	Type	mAP@0.5 ↑	mAP@0.7 ↑	Avg mAP ↑
BMN	Two-stage	38.8%	20.5%	38.5%
G-TAD	Two-stage	40.3%	23.4%	39.3%
MUSES	Two-stage	56.9%	31.0%	—
AFSD	Single-stage	55.5%	31.1%	52.0%
TadTR	Single-stage	49.2%	26.3%	46.6%
ActionFormer	Single-stage	71.0%	43.9%	66.8%

ActionFormer achieves 71.0% mAP at tIoU=0.5 — +14.1 absolute percentage points over the best prior single-stage method (AFSD at 55.5%) and +14.1 over the best two-stage method (MUSES at 56.9%).

EPIC-Kitchens 100 (Egocentric, 100 hours)

Method	Verb Avg mAP ↑	Noun Avg mAP ↑
BMN	8.4%	6.5%
G-TAD	9.4%	8.4%
ActionFormer	23.5%	21.9%

On the challenging egocentric dataset, ActionFormer outperforms BMN/G-TAD by +13.5 average mAP.

Key Ablations (THUMOS14)

Change	Avg mAP	Δ
Full ActionFormer	66.8%	—
Replace Transformer with 1D ConvNet	52.9%	−13.9%
Remove LayerNorm in heads	62.7%	−4.1%
Remove center sampling	65.4%	−1.4%
Add positional encoding	66.6%	−0.2%
Global attention (no local window)	65.9%	−0.9%

The Transformer is the main course: Replacing the Transformer encoder with a 1D convolutional network (matched in layers and parameters) drops performance by 13.9% — from 66.8% to 52.9%. This single change accounts for the vast majority of ActionFormer's advantage. The convolutional baseline can only capture local temporal patterns; the Transformer's self-attention captures the broader context that distinguishes actions from background.

Why is the gap largest on EPIC-Kitchens? On EPIC-Kitchens (egocentric cooking), ActionFormer outperforms baselines by over 13.5% average mAP. This is much larger than the gap on ActivityNet (~1-2%). Possible reasons: (1) EPIC-Kitchens has ~128 actions per video, so there are many interleaved, overlapping actions that benefit from the Transformer's ability to model temporal relationships between them. (2) Egocentric video has fast camera motion, making boundary detection harder — the broader context from self-attention helps disambiguate. (3) ActivityNet has very few actions per video (1.5 avg), limiting the Transformer's advantage in modeling inter-action dependencies.

Results Comparison — THUMOS14

Average mAP on THUMOS14 [0.3:0.1:0.7]. ActionFormer (rightmost) vastly outperforms all prior methods.

According to the ablation, what single change causes the largest performance drop?

Replacing the Transformer encoder with a 1D ConvNet — a drop of 13.9% average mAP, confirming that self-attention's temporal context modeling is the core contribution Removing center sampling Adding positional encoding

Chapter 9: Connections

ActionFormer bridges ideas from object detection, NLP sequence modeling, and video understanding into a clean, unified design.

Relation to FCOS (Object Detection)

ActionFormer is essentially FCOS adapted to 1D. FCOS classifies every spatial location and regresses distances to the bounding box edges. ActionFormer classifies every temporal location and regresses distances to action boundaries. The feature pyramid, center sampling, and anchor-free design all come directly from the FCOS lineage (FPN → RetinaNet → FCOS → ActionFormer).

The mapping is almost one-to-one: FPN's 2D feature maps become 1D temporal sequences. FCOS's 2D center sampling becomes 1D center sampling. RetinaNet's focal loss is adopted without changes. Even the regression normalization by stride follows FCOS exactly. This direct inheritance is a strength — decades of object detection research transfers cleanly to the temporal domain.

Relation to DETR / TadTR

TadTR (a concurrent work) uses a DETR-style set prediction approach for TAL: learned object queries, Hungarian matching, no feature pyramid. ActionFormer takes the opposite approach: no learned queries, no bipartite matching, just dense per-moment prediction with NMS. ActionFormer significantly outperforms TadTR (66.8% vs 46.6% on THUMOS14), suggesting that the dense prediction paradigm is more natural for TAL than set prediction.

Relation to Swin Transformer

ActionFormer's local self-attention within a hierarchical pyramid directly parallels Swin Transformer's window attention with shifted windows in 2D. Both use local attention for efficiency and rely on the hierarchical structure for global context. ActionFormer skips the "shifted window" trick (unnecessary in 1D with the pyramid providing cross-level communication).

Influence on Downstream Work

TemporalMaxer (2023): Showed that replacing ActionFormer's local attention with simple max-pooling still achieves competitive results, questioning how much of the gain comes from attention vs. the pyramid. This sparked an important debate: is the Transformer truly necessary, or is the multiscale architecture the real innovation?
ActionFormer+ / TriDet: Extensions with improved feature aggregation and boundary refinement that build directly on ActionFormer's codebase.
TAL community standard: ActionFormer's codebase became the de facto baseline for TAL research, with most subsequent papers comparing against it. The clean, modular design made it easy to swap components in and out for ablation studies.
Egocentric video understanding: ActionFormer's strong results on EPIC-Kitchens 100 (the largest egocentric action dataset) opened a new direction: applying TAL models to first-person video from AR headsets and wearable cameras.

Pre-extracted features are both a strength and a limitation. By operating on pre-extracted features, ActionFormer avoids the massive cost of end-to-end video model training. But it also means the model can't learn task-specific low-level representations. The quality of the feature backbone (I3D, SlowFast, ViViT) upper-bounds ActionFormer's performance. Future work increasingly moves toward end-to-end training to close this gap.

Cheat Sheet

Aspect	ActionFormer
Input	Pre-extracted video features [T, D_in]
Output	(class, start, end) for each detected action
Encoder	7-block multiscale Transformer, local attn W=19
Decoder	Shared 1D conv heads (3 layers each)
Pyramid	6 levels, 2x downsampling, strides 1–32
Positional encoding	None (intentionally omitted)
Loss	Focal loss (cls) + DIoU loss (reg)
Post-processing	Soft-NMS
Key result	71.0% mAP@0.5 on THUMOS14 (+14.1%)
Training	30 epochs, single GPU, few hours

The broader lesson: Minimalist design wins. ActionFormer has no proposals, no anchors, no positional encoding, no complex loss function, no multi-stage pipeline. Just a Transformer + pyramid + focal loss + NMS. When the core architecture is right (local self-attention on a multiscale pyramid), the simplest decoder suffices. The 14-point mAP improvement came not from engineering tricks but from choosing the right inductive biases for temporal reasoning.

What is the key architectural parallel between ActionFormer and FCOS?

Both are anchor-free models that classify every location (spatial for FCOS, temporal for ActionFormer) and regress distances to boundaries, using a feature pyramid for multi-scale detection Both use learned object queries like DETR Both require ground truth anchor boxes for training

ActionFormer: Localizing Moments of Actions

Chapter 0: The Problem

Chapter 1: The Key Insight

The Representation

Chapter 2: The Architecture

Projection Layer

Transformer Encoder (L = 7 blocks)

Decoder

Chapter 3: The Temporal Feature Pyramid

Why a Pyramid?

The Structure

Chapter 4: Local Self-Attention

The Mechanism

No Positional Encoding

Chapter 5: The Decoder Heads

Classification Head

Regression Head

The Training Loss

Why DIoU Loss for Regression?

Chapter 6: Training Details

Center Sampling

Training Configuration

Variable-Length Handling

Chapter 7: Inference & Decoding

Step 1: Forward Pass

Step 2: Decode Candidates

Step 3: Soft-NMS

Multi-Level Decoding

Computational Cost

Chapter 8: Results

THUMOS14 (20 action categories, 413 videos)

EPIC-Kitchens 100 (Egocentric, 100 hours)

Key Ablations (THUMOS14)

Chapter 9: Connections

Relation to FCOS (Object Detection)

Relation to DETR / TadTR

Relation to Swin Transformer

Influence on Downstream Work

Cheat Sheet