Chen-Lin Zhang, Jianxin Wu, Yin Li — Nanjing University / UW-Madison, 2022

ActionFormer: Localizing Moments of Actions

Find when actions happen in untrimmed video using local self-attention on a multiscale temporal feature pyramid. No proposals, no anchors — classify every moment and regress its boundaries in a single shot.

Prerequisites: Self-attention / Transformers + Feature pyramids (FPN) + Object detection basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have a 30-minute untrimmed video of a soccer match. Somewhere in there, a player does a "bicycle kick" from 12:03 to 12:05, a "header" from 18:30 to 18:32, and a "tackle" from 22:10 to 22:14. The goal of Temporal Action Localization (TAL) is to find all action instances: their start time, end time, and category label.

This is much harder than action classification (which tells you if a 3-second clip contains a kick — yes or no). TAL must answer: when does the kick start, when does it end, and what is happening at every other moment (background or another action)?

The fundamental challenge: Actions vary wildly in duration. A "long jump" lasts 3 seconds. A "cooking activity" lasts 3 minutes. A model needs to capture temporal context at many scales simultaneously. And in a 30-minute video with only 5 action instances, over 99% of moments are background — extreme class imbalance.

Prior approaches fall into two camps:

ActionFormer takes the simplest possible approach: classify every single moment in the video as either background or one of C action categories, and regress the distance from that moment to the nearest action boundary. No proposals. No anchors. Just a Transformer that looks at every moment and asks: "Is this an action? If so, how far are its boundaries?"

An important design decision: ActionFormer does not process raw video frames. It operates on pre-extracted features from a frozen video backbone (I3D, SlowFast). Each input "time step" is a feature vector representing a 1-second video clip, not a single frame. This decouples the temporal reasoning from the visual representation, just as FCOS decouples detection from feature extraction.

Full data flow at a glance: Untrimmed video → Pre-extracted clip features X = {x1, ..., xT} (e.g., I3D at 1 fps) → 1D convolution projection: [T, Din] → [T, D] → L Transformer blocks with local self-attention + 2x downsampling → feature pyramid Z = {Z1, ..., ZL} at resolutions T, T/2, T/4, ..., T/2L-1 → Shared classification head: every moment → C action probabilities (sigmoid, focal loss) → Shared regression head: every moment → (ds, de) distances to action onset/offset (DIoU loss) → Soft-NMS → Final detections.
Temporal Action Localization

A timeline with embedded action instances (colored). The model must detect each action's start, end, and label. Most of the video is background (gray).

What makes temporal action localization harder than action classification?

Chapter 1: The Key Insight

ActionFormer's core insight: treat every moment as an action candidate and let the model decide. Instead of generating proposals (two-stage) or using pre-defined anchor windows (single-stage), just classify every time step and regress its boundaries directly.

This is the temporal equivalent of what FCOS did for object detection: instead of anchor boxes, every spatial location predicts a class and distances to the box edges. ActionFormer does the same thing in 1D: every temporal location predicts a class and distances to the action's onset and offset.

The Representation

For every time step t in the video, ActionFormer outputs:

From these, decoding an action is trivial: start = t − ds, end = t + de, label = argmax p(at).

Why this works so well: The key is the Transformer encoder. It captures long-range temporal context, so each moment's prediction is informed by what's happening hundreds of time steps away. A convolution with a 3× kernel only sees 3 time steps. A local self-attention window of size 19 sees 19 steps — and at the 5th pyramid level (16x downsampled), those 19 steps span 304 original time steps. The multiscale pyramid means short actions are detected at fine levels and long actions at coarse levels.
What happens when inputs degrade: ActionFormer operates on pre-extracted features, not raw video. If you replace I3D features with weaker TSN features, mAP drops from 66.8% to ~52%. The feature backbone quality is the single biggest factor. If the video is very long (10,000+ time steps), the fixed-length cropping during training means the model has only seen windows — but at inference, the full sequence is processed (no positional encoding), so it generalizes. With very few actions per video (ActivityNet: 1.5 avg), the extreme class imbalance hurts more than on THUMOS14 (15+ actions per video).
Input
Pre-extracted features X = {x1, ..., xT} from I3D, SlowFast, or R(2+1)D. Not raw frames.
Encoder
Multiscale Transformer: project + L blocks with local self-attention + 2x downsampling → feature pyramid Z.
Decoder
Shared 1D conv heads on every pyramid level: classify every moment + regress boundaries.
Post-process
Decode (t, ds, de) → (start, end, label). Soft-NMS to remove duplicates.
How does ActionFormer represent its output, and how are action boundaries decoded?

Chapter 2: The Architecture

ActionFormer follows a clean encoder-decoder design. The encoder is a multiscale Transformer that builds a temporal feature pyramid. The decoder is a lightweight convolutional network with shared heads.

Projection Layer

The input features X = {x1, ..., xT} (each xt ∈ RDin, e.g., Din = 2048 for I3D) are projected to D = 512 dimensions using a shallow 1D convolutional network with ReLU activation:

Z0 = [E(x1), E(x2), ..., E(xT)]T ∈ RT×D

Adding convolutions before the Transformer was found to help incorporate local context and stabilize training.

Transformer Encoder (L = 7 blocks)

Each block applies local multi-headed self-attention (MSA) followed by an MLP, with LayerNorm before each and residual connections after:

l = αl · MSA(LN(Zl-1)) + Zl-1
l = ᾱl · MLP(LN(Z̄l)) + Z̄l
Zl = ↓(Ẑl)

where αl and ᾱl are per-channel learnable scaling factors, and ↓ is optional 2x downsampling via strided depthwise 1D convolution. The first 2 blocks operate at full resolution T; the remaining 5 blocks downsample by 2x each, creating pyramid levels at T, T/2, T/4, T/8, T/16, T/32.

Frozen vs. Trained: Video feature backbone (I3D, SlowFast): frozen — features are pre-extracted offline. Projection convolutions: trained. All 7 Transformer blocks: trained. Classification head: trained, shared across pyramid levels. Regression head: trained, shared across pyramid levels. No positional encoding (ablation showed it hurts). Total model is lightweight — the heavy lifting was done by the pre-extracted features.

Decoder

Two separate but architecturally identical convolutional heads are attached to every pyramid level:

Both heads share weights across all pyramid levels. A coarser level naturally has larger regression targets (longer actions live at coarser resolutions), so the regression range is normalized by the feature stride of each level. This normalization is crucial: without it, the regression head would need to output values spanning from 2 time steps (short action at level 1) to 500+ time steps (long action at level 7) — an impossible dynamic range for a shared network. With stride normalization, all targets are in the same manageable range.

ActionFormer Pipeline

Pre-extracted features flow through projection, Transformer blocks with downsampling, and shared decoder heads at each pyramid level.

Why does ActionFormer share the classification and regression head weights across all pyramid levels?

Chapter 3: The Temporal Feature Pyramid

The feature pyramid is ActionFormer's secret weapon. It elegantly solves the problem of variable-duration actions by distributing different temporal scales across different levels.

Why a Pyramid?

A "long jump" lasts 3 seconds. A "cooking" activity lasts 3 minutes. If you only look at the finest temporal resolution, detecting a 3-minute activity requires a receptive field of 180 time steps (at 1 fps). That's impractical with local attention. But at 32x downsampled resolution, those 180 steps become just 6 — easily captured by a window of 19.

The Structure

With L = 7 Transformer blocks and 2x downsampling on the last 5:

LevelResolutionStrideLocal WindowEffective RangeRegression Range
Z1T11919 steps[0, 4)
Z2T11919 steps[4, 8)
Z3T/221938 steps[8, 16)
Z4T/441976 steps[16, 32)
Z5T/8819152 steps[32, 64)
Z6T/161619304 steps[64, 128)
Z7T/323219608 steps[128, ∞)

Each level specializes in a different duration range. Short actions (a few seconds) are detected at levels Z1-Z2. Long actions (several minutes) are detected at Z6-Z7. The regression range per level is normalized by the stride, roughly doubling with each level.

Why 2x downsampling, not 4x or 8x? 2x downsampling is the sweet spot. It provides fine-grained scale coverage (every scale from a few seconds to 10+ minutes) without too many or too few levels. The design directly mirrors FPN and FCOS from object detection, adapted to 1D. Ablation confirms: 2x is optimal on THUMOS14 (66.8% avg mAP vs. 65.1% with 4x).
Downsampling implementation: A strided depthwise 1D convolution with stride=2. Depthwise means each channel is convolved independently, then mixed. This is more expressive than simple average pooling (which was also tested and slightly worse), while being efficient. The convolution learns what information to preserve when halving the resolution.
Temporal Feature Pyramid

The pyramid has 7 levels at decreasing temporal resolutions. Short actions are detected at fine levels (top), long actions at coarse levels (bottom). The orange boxes show the effective temporal range at each level.

How does the temporal feature pyramid handle actions of vastly different durations?

Chapter 4: Local Self-Attention

Standard (global) self-attention computes similarity between every pair of time steps. For a video with T = 2048 time steps, that's T2 = 4 million pairs. This is O(T2D) in both time and memory — prohibitively expensive for long videos.

ActionFormer's solution: local self-attention. Each time step only attends to its W nearest neighbors (window size W = 19). This reduces complexity from O(T2D) to O(W2TD) — and since W is a small constant (19 << T), this is effectively O(TD).

The Mechanism

For each time step t, self-attention is computed only within the window [t - W/2, t + W/2]. The queries, keys, and values are computed as usual:

Q = ZlWQ,   K = ZlWK,   V = ZlWV
St = softmax(Qt · K[t-W/2:t+W/2]T / √Dq) · V[t-W/2:t+W/2]

The key observation: temporal context beyond a certain range is less helpful for action localization. Whether something happened 5 minutes ago rarely matters for detecting the current action. But what happened 10 seconds ago matters a lot. Local attention captures exactly this inductive bias.

The ablation confirms this: replacing local attention (W=19) with global attention actually decreases average mAP by 0.9% on THUMOS14 (from 66.8% to 65.9%). Global attention dilutes the model's focus with irrelevant distant context and increases computation. Local attention is both cheaper and better.

Local attention + pyramid = global context: Local attention alone would limit the model to a 19-step temporal range. But on the pyramid, the same 19-step window at level Z6 (stride 16) covers 19 × 16 = 304 original time steps. At Z7 (stride 32), it covers 608 steps. So the model has both fine-grained local context (lower levels) and broad global context (upper levels), without ever computing global attention.

No Positional Encoding

Surprisingly, ActionFormer works better without positional encoding. The ablation shows that adding sinusoidal or learned positional encodings slightly hurts performance. The reason: the projection convolutions and strided depthwise convolutions in the Transformer blocks already leak positional information (convolutions are inherently position-aware through their structure). Adding explicit positional encoding is redundant and slightly harmful.

This also means ActionFormer can process any length video at inference time — there's no positional encoding that was trained for a specific sequence length.

Engineering decision — learnable scaling: Each Transformer block has learnable per-channel scaling factors αl and ᾱl (initialized to small values). These are multiplied with the MSA and MLP outputs before the residual connection: Z̄l = αl · MSA(...) + Zl-1. This stabilizes early training by starting with near-identity residual connections and gradually increasing the Transformer's contribution. Without these factors, training is unstable for deep pyramids.
Multi-head attention details: ActionFormer uses 8 attention heads with D = 512 dimensional features, so each head operates on 64 dimensions. The local window is applied identically across all heads — no head attends globally. The MLP in each Transformer block follows the standard design: Linear(D, 4D) → GELU → Linear(4D, D), expanding to 2048 intermediate dimensions.
Local vs. Global Self-Attention

Left: global attention (O(T2) pairs). Right: local attention with window W=5 (O(W·T) pairs). Toggle between them to see the difference in attended positions for the selected time step (orange).

Mode: local
Why does ActionFormer NOT use positional encoding?

Chapter 5: The Decoder Heads

The decoder is deliberately simple — the heavy lifting is done by the Transformer encoder. Two lightweight 1D convolutional heads are applied to every time step at every pyramid level.

Classification Head

Three layers of 1D convolutions (kernel size 3, 512 channels), with LayerNorm on the first two layers and ReLU activation. The final layer outputs C channels (one per action category) followed by sigmoid. This is a multi-label formulation: each category is an independent binary decision. A time step can theoretically be labeled with multiple actions simultaneously (though this is rare in practice).

Regression Head

Same architecture as classification. Output: 2 channels (ds, de) followed by ReLU to ensure positive distances. The distances are normalized by the stride of the current pyramid level, so the network always predicts in "level-relative" units regardless of the actual temporal scale.

Why LayerNorm in the heads? Adding LayerNorm to the first two conv layers of the classification head boosted average mAP by 2.7% on THUMOS14. Without it, the features from different pyramid levels have different magnitudes (deeper levels have larger activations due to more self-attention layers). LayerNorm normalizes these, allowing the shared heads to work consistently across all levels.

The Training Loss

The total loss per video has two terms:

L = ∑t (Lcls + λreg · 1ct · Lreg) / T+

Where:

Focal loss is essential: In a 30-minute video at 1 fps = 1800 time steps. If there are 10 action instances averaging 5 seconds each, that's 50 positive steps and 1750 negative steps (97% background). Without focal loss, the massive negative class dominates the gradient, and the model learns to predict "background" everywhere. Focal loss with γ=2 reduces the loss contribution of well-classified negatives by a factor of ~100, letting the model focus on the tricky moments near action boundaries.

Why DIoU Loss for Regression?

Standard L1 or L2 regression on (ds, de) has a problem: when the predicted segment and the ground truth segment don't overlap at all, L1/L2 still gives a gradient, but it doesn't reflect how close they are in terms of temporal overlap. DIoU (Distance IoU) loss combines two signals:

The regression loss is only applied to positive samples (time steps within an action). Background moments have no meaningful (ds, de) target.

Tensor shapes through the decoder: At each pyramid level l, the features Zl ∈ RTl×D are fed through the classification head: Conv1d(D, D, 3) + LN + ReLU → Conv1d(D, D, 3) + LN + ReLU → Conv1d(D, C, 3) + sigmoid → [Tl, C]. Regression head: same but → Conv1d(D, 2, 3) + ReLU → [Tl, 2]. Total across all levels: ∑ Tl predictions. Both heads share weights across levels.
Why is focal loss essential for ActionFormer's training?

Chapter 6: Training Details

ActionFormer is trained end-to-end with Adam optimizer on the combined classification + regression loss. Several training tricks are critical for performance.

Center Sampling

Not every time step within an action is labeled as positive. Only time steps near the center of the action (within α = 1.5 strides of the center) are considered positive for training. This has two benefits:

Center sampling adds +1.4% average mAP on THUMOS14. It doesn't affect inference — at test time, every time step still makes a prediction.

Training Configuration

SettingTHUMOS14ActivityNet 1.3EPIC-Kitchens 100
FeaturesI3D (2-stream)I3D / R(2+1)D+TSPSlowFast
Feature dim204820482304
Max seq length2304192 (downsampled)2304
Attention window19119
Epochs301530
OptimizerAdamAdamAdam
Learning rate1e-41e-41e-4
Warmup5 epochs5 epochs5 epochs
Training is fast: THUMOS14 has only 200 training videos. At 30 epochs with pre-extracted features, training takes a few hours on a single GPU. No multi-day training runs, no 8-GPU clusters. The expensive part is feature extraction (I3D, SlowFast), which is done offline once. This is a major practical advantage: you can iterate on the model quickly.
Warmup is critical: Without the 5-epoch learning rate warmup, training often diverges — the Transformer's self-attention is sensitive to initial weight scales. The learnable scaling factors α help, but warmup provides additional stability. This was also observed in ViT and DeiT for vision Transformers.

Variable-Length Handling

Videos vary wildly in duration: THUMOS14 videos can be 2-6 minutes, ActivityNet videos can be 5-30+ minutes. During training, sequences are padded or cropped to a fixed maximum length, with proper attention masking to prevent padded positions from influencing the output. This is equivalent to training with sliding windows.

An important finding: varying the maximum input sequence length during training has little impact on performance. The model generalizes to different lengths at inference because (a) there's no positional encoding, and (b) local self-attention doesn't depend on absolute position — only relative context within the window matters.

Dataset characteristics drive design choices: THUMOS14 has ~15 actions per video in short videos → many positive samples, balanced training, window size 19 works well. ActivityNet has ~1.5 actions per extremely long videos → extreme imbalance, features downsampled to fixed length 192, window size 11 (less context needed since features are already compressed). EPIC-Kitchens has ~128 actions per video in long egocentric sequences → many overlapping fine-grained actions, window size 9 (egocentric actions are local). The attention window size is the main per-dataset hyperparameter.
What is center sampling and why does it help?

Chapter 7: Inference & Decoding

At inference, the full video sequence is fed through ActionFormer in a single forward pass. Since there's no positional encoding, the model handles any video length.

Step 1: Forward Pass

The pre-extracted features X = {x1, ..., xT} go through the encoder, producing the pyramid Z = {Z1, ..., ZL}. The shared heads produce (p(at), dst, det) for every time step t at every level l.

Step 2: Decode Candidates

Each time step t at level l produces a candidate action:

at = argmax p(at),   st = t − dst,   et = t + det

With L = 7 levels and a video of T = 2048 steps, this produces roughly T + T/2 + T/4 + ... = ~4000 candidates. Most will be background (low p(at)) and are filtered by a confidence threshold.

Step 3: Soft-NMS

Multiple candidates may overlap in time (especially from adjacent time steps and adjacent pyramid levels). Soft-NMS suppresses overlapping detections by decaying their confidence scores rather than hard-deleting them. This is gentler than standard NMS and preserves detections of overlapping actions (e.g., simultaneous "running" and "dribbling").

Why Soft-NMS, not standard NMS? Standard NMS with IoU threshold 0.5 would delete all overlapping detections, keeping only the highest-confidence one. But in TAL, actions can genuinely overlap in time (you can "talk" while "walking"). Soft-NMS decays overlapping scores by a Gaussian function: score → score · exp(−IoU2 / σ). This preserves truly overlapping actions while still suppressing duplicate detections of the same action.

Multi-Level Decoding

Each pyramid level independently produces detections. A short "kick" action detected at level Z1 might also produce a (weaker) detection at level Z3. These cross-level duplicates are handled naturally by Soft-NMS: the higher-confidence detection survives, and the duplicate's score gets decayed. This is why a single round of Soft-NMS suffices — it handles both within-level and cross-level duplicates simultaneously.

The regression targets at each level are clipped to a predefined range. Level Z1 (stride 1) only predicts actions with ds + de in [0, 4) time steps. Level Z7 (stride 32) predicts actions with ds + de in [128, ∞). This prevents a fine-resolution level from trying to predict a 5-minute action (which would require huge regression values) and vice versa.

Score fusion (optional): On ActivityNet, ActionFormer can optionally fuse its per-moment action scores with external video-level classification scores (from a separate model). This helps because ActivityNet videos typically contain only 1-2 actions, making video-level context useful. On THUMOS14 (15+ actions per video), this helps less. With score fusion, ActionFormer reaches 36.6% average mAP on ActivityNet, outperforming all methods using the same features.

Computational Cost

At inference, the dominant cost is the Transformer encoder's self-attention. With local attention (window W = 19) and L = 7 blocks, the cost per block is O(W · Tl · D) where Tl is the sequence length at level l. Since Tl halves at each level (after the first two), the total cost is roughly O(W · T · D · L) — linear in video length. A 30-minute video processed in a single forward pass takes well under a second on a modern GPU, making ActionFormer practical for real-world applications.

Inference Pipeline

Watch how raw predictions from multiple pyramid levels are decoded into candidate actions, then refined by Soft-NMS into final detections.

Why does ActionFormer use Soft-NMS instead of standard NMS?

Chapter 8: Results

ActionFormer establishes new state of the art on all three major TAL benchmarks, surpassing both two-stage and single-stage methods by large margins.

THUMOS14 (20 action categories, 413 videos)

MethodTypemAP@0.5 ↑mAP@0.7 ↑Avg mAP ↑
BMNTwo-stage38.8%20.5%38.5%
G-TADTwo-stage40.3%23.4%39.3%
MUSESTwo-stage56.9%31.0%
AFSDSingle-stage55.5%31.1%52.0%
TadTRSingle-stage49.2%26.3%46.6%
ActionFormerSingle-stage71.0%43.9%66.8%

ActionFormer achieves 71.0% mAP at tIoU=0.5 — +14.1 absolute percentage points over the best prior single-stage method (AFSD at 55.5%) and +14.1 over the best two-stage method (MUSES at 56.9%).

EPIC-Kitchens 100 (Egocentric, 100 hours)

MethodVerb Avg mAP ↑Noun Avg mAP ↑
BMN8.4%6.5%
G-TAD9.4%8.4%
ActionFormer23.5%21.9%

On the challenging egocentric dataset, ActionFormer outperforms BMN/G-TAD by +13.5 average mAP.

Key Ablations (THUMOS14)

ChangeAvg mAPΔ
Full ActionFormer66.8%
Replace Transformer with 1D ConvNet52.9%−13.9%
Remove LayerNorm in heads62.7%−4.1%
Remove center sampling65.4%−1.4%
Add positional encoding66.6%−0.2%
Global attention (no local window)65.9%−0.9%
The Transformer is the main course: Replacing the Transformer encoder with a 1D convolutional network (matched in layers and parameters) drops performance by 13.9% — from 66.8% to 52.9%. This single change accounts for the vast majority of ActionFormer's advantage. The convolutional baseline can only capture local temporal patterns; the Transformer's self-attention captures the broader context that distinguishes actions from background.
Why is the gap largest on EPIC-Kitchens? On EPIC-Kitchens (egocentric cooking), ActionFormer outperforms baselines by over 13.5% average mAP. This is much larger than the gap on ActivityNet (~1-2%). Possible reasons: (1) EPIC-Kitchens has ~128 actions per video, so there are many interleaved, overlapping actions that benefit from the Transformer's ability to model temporal relationships between them. (2) Egocentric video has fast camera motion, making boundary detection harder — the broader context from self-attention helps disambiguate. (3) ActivityNet has very few actions per video (1.5 avg), limiting the Transformer's advantage in modeling inter-action dependencies.
Results Comparison — THUMOS14

Average mAP on THUMOS14 [0.3:0.1:0.7]. ActionFormer (rightmost) vastly outperforms all prior methods.

According to the ablation, what single change causes the largest performance drop?

Chapter 9: Connections

ActionFormer bridges ideas from object detection, NLP sequence modeling, and video understanding into a clean, unified design.

Relation to FCOS (Object Detection)

ActionFormer is essentially FCOS adapted to 1D. FCOS classifies every spatial location and regresses distances to the bounding box edges. ActionFormer classifies every temporal location and regresses distances to action boundaries. The feature pyramid, center sampling, and anchor-free design all come directly from the FCOS lineage (FPN → RetinaNet → FCOS → ActionFormer).

The mapping is almost one-to-one: FPN's 2D feature maps become 1D temporal sequences. FCOS's 2D center sampling becomes 1D center sampling. RetinaNet's focal loss is adopted without changes. Even the regression normalization by stride follows FCOS exactly. This direct inheritance is a strength — decades of object detection research transfers cleanly to the temporal domain.

Relation to DETR / TadTR

TadTR (a concurrent work) uses a DETR-style set prediction approach for TAL: learned object queries, Hungarian matching, no feature pyramid. ActionFormer takes the opposite approach: no learned queries, no bipartite matching, just dense per-moment prediction with NMS. ActionFormer significantly outperforms TadTR (66.8% vs 46.6% on THUMOS14), suggesting that the dense prediction paradigm is more natural for TAL than set prediction.

Relation to Swin Transformer

ActionFormer's local self-attention within a hierarchical pyramid directly parallels Swin Transformer's window attention with shifted windows in 2D. Both use local attention for efficiency and rely on the hierarchical structure for global context. ActionFormer skips the "shifted window" trick (unnecessary in 1D with the pyramid providing cross-level communication).

Influence on Downstream Work

Pre-extracted features are both a strength and a limitation. By operating on pre-extracted features, ActionFormer avoids the massive cost of end-to-end video model training. But it also means the model can't learn task-specific low-level representations. The quality of the feature backbone (I3D, SlowFast, ViViT) upper-bounds ActionFormer's performance. Future work increasingly moves toward end-to-end training to close this gap.

Cheat Sheet

AspectActionFormer
InputPre-extracted video features [T, Din]
Output(class, start, end) for each detected action
Encoder7-block multiscale Transformer, local attn W=19
DecoderShared 1D conv heads (3 layers each)
Pyramid6 levels, 2x downsampling, strides 1–32
Positional encodingNone (intentionally omitted)
LossFocal loss (cls) + DIoU loss (reg)
Post-processingSoft-NMS
Key result71.0% mAP@0.5 on THUMOS14 (+14.1%)
Training30 epochs, single GPU, few hours
The broader lesson: Minimalist design wins. ActionFormer has no proposals, no anchors, no positional encoding, no complex loss function, no multi-stage pipeline. Just a Transformer + pyramid + focal loss + NMS. When the core architecture is right (local self-attention on a multiscale pyramid), the simplest decoder suffices. The 14-point mAP improvement came not from engineering tricks but from choosing the right inductive biases for temporal reasoning.
What is the key architectural parallel between ActionFormer and FCOS?