Plain ViT + two deconv layers = state-of-the-art pose estimation. No fancy modules. No hierarchical features. No domain-specific tricks. Just a transformer and a lightweight decoder.
You want to detect human body keypoints — wrists, elbows, shoulders, knees, ankles — from a cropped image of a person. This is human pose estimation, and it powers everything from sports analytics to sign language recognition to AR avatar tracking.
By 2022, the state-of-the-art methods all share a common pattern: they use increasingly complex architectures designed specifically for this task.
Every method adds more domain-specific complexity. Multi-resolution branches. Special keypoint tokens. CNN-transformer hybrids. Cross-attention modules. Each new paper is more elaborate than the last.
This is exactly what ViTPose demonstrates. And the results are shocking: a plain ViT-H with just two deconvolution layers achieves 79.1 AP on COCO val — beating HRFormer-B (75.6 AP) while running 50% faster (241 fps vs 158 fps). The simplest model wins.
ViTPose's thesis is radical in its simplicity: a plain, non-hierarchical vision transformer with MAE pre-training has such strong feature representations that a trivial decoder is all you need for state-of-the-art pose estimation.
Let's unpack why this works by considering what happens inside a ViT during self-attention. Every token (patch) attends to every other token. After 12+ layers of this global communication, each token "knows about" the entire image. Compare this to a CNN, where the receptive field grows slowly — a ResNet-50 needs deep stacking just to let distant pixels influence each other.
The authors test two decoders:
Here is what happens when you swap from classic to simple decoder:
| Backbone | Classic Decoder AP | Simple Decoder AP | Δ |
|---|---|---|---|
| ResNet-50 | 71.8 | 53.1 | −18.7 |
| ResNet-152 | 73.5 | 55.3 | −18.2 |
| ViT-B | 75.8 | 75.5 | −0.3 |
| ViT-L | 78.3 | 78.2 | −0.1 |
| ViT-H | 79.1 | 78.9 | −0.2 |
This finding explains why all those prior methods needed complex decoders: they were compensating for weak backbone features. With a strong enough backbone, complexity becomes overhead.
ViTPose demonstrates four surprisingly strong properties of plain ViTs for pose estimation:
Let's trace a person image through the entire ViTPose pipeline. The simplicity is striking.
ViTPose follows the top-down paradigm: first detect people with a person detector, then estimate keypoints for each cropped person instance. The input is a single cropped person image x ∈ RH×W×3, typically 256×192 pixels.
The image is split into non-overlapping patches of size d×d (default d=16). Each patch is linearly projected to a C-dimensional token:
For 256×192 input with d=16: (256/16) × (192/16) = 16 × 12 = 192 tokens, each of dimension C (768 for ViT-B, 1024 for ViT-L, 1280 for ViT-H).
Each of N transformer layers applies the standard two-step update:
That's it. No cross-attention. No multi-resolution branches. No feature pyramid. Just self-attention and feed-forward, repeated N times. The spatial resolution stays constant throughout — every transformer layer operates on the same 16×12 grid.
The features Fout need to be upsampled and converted to K=17 keypoint heatmaps. The classic decoder:
Each heatmap is a 64×48 spatial probability map for one keypoint. The predicted keypoint location is the argmax of each heatmap (with sub-pixel refinement via UDP post-processing).
Ground-truth heatmaps are generated by placing a 2D Gaussian at each annotated keypoint location. The loss is MSE between predicted and ground-truth heatmaps:
Backbone is initialized with MAE pre-trained weights. Training uses AdamW with learning rate 5e-4, layer-wise learning rate decay, and stochastic drop path. 210 epochs total, with LR decay at epochs 170 and 200.
The decoder comparison is ViTPose's most important experiment. It reveals what kind of features ViTs learn versus CNNs. Let's look at both decoders in detail and understand why they tell such different stories for different backbones.
# Classic decoder: 2 deconv blocks + prediction class ClassicDecoder(nn.Module): def __init__(self, in_channels, num_keypoints=17): self.deconv1 = nn.Sequential( nn.ConvTranspose2d(in_channels, 256, 4, stride=2, padding=1), nn.BatchNorm2d(256), nn.ReLU() ) self.deconv2 = nn.Sequential( nn.ConvTranspose2d(256, 256, 4, stride=2, padding=1), nn.BatchNorm2d(256), nn.ReLU() ) self.predict = nn.Conv2d(256, num_keypoints, 1) def forward(self, x): # x: [B, C, H/16, W/16] x = self.deconv1(x) # [B, 256, H/8, W/8] x = self.deconv2(x) # [B, 256, H/4, W/4] return self.predict(x) # [B, 17, H/4, W/4]
# Simple decoder: bilinear upsample + conv class SimpleDecoder(nn.Module): def __init__(self, in_channels, num_keypoints=17): self.predict = nn.Conv2d(in_channels, num_keypoints, 3, padding=1) def forward(self, x): # x: [B, C, H/16, W/16] x = F.interpolate(x, scale_factor=4, mode='bilinear') # [B, C, H/4, W/4] x = F.relu(x) return self.predict(x) # [B, 17, H/4, W/4]
The simple decoder has essentially one learnable layer — a 3×3 convolution that maps C channels to 17 keypoint heatmaps. The bilinear upsample is parameter-free. This is as close to "linear probing" as a decoder can get.
ResNet-152 drops from 73.5 to 55.3 AP (−18.2). Why?
CNN features at the final layer are high-level and abstract but spatially coarse. They encode "there's a person here" but not "the left wrist is at pixel (134, 87)." The classic decoder's two deconv blocks with learnable parameters perform non-trivial spatial refinement — they learn to "unpack" the abstract features back into precise spatial locations. Without this learned upsampling, you just get blurry bilinear interpolation of features that were never designed to be spatially precise.
ViT features are fundamentally different. Through N layers of global self-attention, every token encodes information about every other token's position. The features at each spatial location already contain precise, globally-informed positional information. Bilinear upsampling of these features preserves this precision because the information is encoded in the channel dimensions, not just the spatial layout.
This is also why the AP50 metric (which allows 50% IoU tolerance) shows almost no difference between decoders for ViTs (90.7 vs 90.6 for ViT-B). The keypoints are already in roughly the right place — the decoder only matters for sub-pixel precision.
Because ViTPose's architecture is so simple, scaling it is trivial: just use a bigger ViT. No need to redesign multi-resolution branches, adjust cross-attention modules, or rebalance feature pyramid weights. You literally swap the backbone and retrain.
| Model | Backbone | Layers | Dim | Params | Speed (fps) | AP |
|---|---|---|---|---|---|---|
| ViTPose-B | ViT-B | 12 | 768 | 86M | 944 | 75.8 |
| ViTPose-L | ViT-L | 24 | 1024 | 307M | 411 | 78.3 |
| ViTPose-H | ViT-H | 32 | 1280 | 632M | 241 | 79.1 |
| ViTPose-G | ViTAE-G | — | — | ~1B | — | 80.9* |
* ViTPose-G uses larger input (576×432), multi-dataset training, and a better detector. Shown for completeness.
Three observations from this table:
1. Consistent gains across scale. B→L gives +2.5 AP. L→H gives +0.8 AP. No saturation — bigger models keep improving. This is in stark contrast to CNNs, where ResNet-50 to ResNet-152 gives only +1.7 AP and larger ResNets show diminishing returns.
2. Speed remains competitive. ViTPose-H runs at 241 fps on an A100 with batch size 64. That's faster than HRFormer-B (158 fps) despite having 15× more parameters. Why? Because ViTPose has a single-branch architecture operating at 1/16 resolution, while HRFormer maintains four parallel branches at 1/4 resolution. Fewer branches, lower resolution → better hardware utilization.
3. The Pareto front shifts. No prior method matches ViTPose at any throughput level. At 944 fps, ViTPose-B (75.8 AP) beats HRNet-W48 (75.1 AP at 649 fps) — both better accuracy AND faster. At 241 fps, ViTPose-H (79.1 AP) crushes HRFormer-B (75.6 AP at 158 fps).
When you go from ViT-B to ViT-H, you're increasing:
Each additional layer lets tokens refine their understanding of the image. For pose estimation, this matters most for occluded keypoints: a wrist hidden behind a torso requires deep reasoning about body structure. More layers = better reasoning = more accurate localization of hard keypoints.
ViTPose demonstrates flexibility along four axes. Each one challenges a conventional assumption about how pose estimation models should be trained.
The standard recipe: pre-train the backbone on ImageNet-1K (1.3M images), then finetune on COCO for pose. But what if you don't have (or don't want to use) ImageNet?
| Pre-training Data | Volume | AP |
|---|---|---|
| ImageNet-1K | 1.3M | 75.8 |
| COCO only (cropped persons) | 150K | 74.5 |
| COCO + AI Challenger (cropped) | 500K | 75.8 |
| COCO + AI Challenger (no crop) | 300K | 75.8 |
With COCO + AI Challenger, ViTPose matches ImageNet pre-training with less than half the data. Even COCO alone (150K images, 10× less than ImageNet) only drops 1.3 AP. The MAE pre-training learns useful representations from whatever images you give it — domain-specific data is actually more data-efficient than generic ImageNet images.
| Input Size | Tokens | AP |
|---|---|---|
| 224 × 224 | 196 | 74.9 |
| 256 × 192 | 192 | 75.8 |
| 384 × 288 | 432 | 76.9 |
| 576 × 432 | 972 | 77.8 |
Performance scales smoothly with resolution. An interesting detail: 256×256 (256 tokens) gives the same 75.8 AP as 256×192 (192 tokens). Why? The average person bounding box in COCO has a 4:3 aspect ratio, so the rectangular input wastes fewer pixels on background padding.
Full self-attention on high-resolution features (1/8 instead of 1/16 by using stride-8 patch embedding) gives the best results but requires 36GB memory even with FP16. ViTPose explores alternatives:
| Attention Type | Memory | AP |
|---|---|---|
| Full attention (1/8) | 36.1 GB | 77.4 |
| Window (8×8) | 21.2 GB | 66.4 |
| Window + Shift | 21.2 GB | 76.4 |
| Window + Pool | 22.9 GB | 76.4 |
| Window + Shift + Pool | 22.9 GB | 76.8 |
| Window (16×12) + Shift + Pool | 26.8 GB | 77.1 |
Pure window attention is catastrophic (−11 AP) because no cross-window communication means the model can't reason about distant keypoints. But adding shift-window (from Swin) or pooling-window restores most of the performance at 40% less memory. The two mechanisms are complementary — combined, they approach full attention quality (76.8 vs 77.4) at drastically reduced cost.
| Finetuning Strategy | AP | Δ |
|---|---|---|
| Full finetuning (all parameters) | 75.8 | — |
| Freeze MHSA, train FFN + decoder | 75.1 | −0.7 |
| Freeze FFN, train MHSA + decoder | 72.8 | −3.0 |
Here's a practical problem: you have multiple pose datasets (COCO, AI Challenger, MPII), each with different keypoint definitions, different numbers of keypoints, and different annotation styles. Most methods train separate models for each dataset. Can we train one model on all of them?
ViTPose's decoder is so lightweight that adding extra decoders for additional datasets is nearly free. The strategy:
| Training Data | AP on COCO val | Δ |
|---|---|---|
| COCO only | 75.8 | — |
| COCO + AI Challenger | 77.0 | +1.2 |
| COCO + AIC + MPII | 77.1 | +1.3 |
Adding AI Challenger (which has 350K+ labeled instances) gives a big +1.2 AP boost. Adding MPII (only 40K instances) gives another +0.1 AP despite being much smaller. The shared backbone learns better features when trained on diverse data, even though the target evaluation is COCO-only.
The extra computational cost is minimal: the three decoders together add less than 2% to the total FLOPs because the backbone dominates the compute. With multi-dataset training, ViTPose-H reaches 79.5 AP on COCO val — a +0.4 AP improvement for almost no extra cost.
A remarkable detail: after multi-dataset training, ViTPose is evaluated directly on each dataset's val set without any dataset-specific finetuning. On OCHuman (a heavily occluded variant of COCO), ViTPose-G achieves 92.8 AP — over 10 AP above the previous state-of-the-art (MIPNet at 74.1 AP). The plain ViT backbone handles occlusion naturally through global self-attention, without any occlusion-specific modules.
ViTPose-H is great, but at 632M parameters it's too large for some deployment scenarios. Can we transfer the knowledge of the large model into a smaller one? Standard knowledge distillation (KD) works, but ViTPose introduces a clever addition: the knowledge token.
The baseline approach: train the student to match the teacher's heatmap outputs.
This alone gives +0.2 AP when transferring from ViTPose-L (teacher) to ViTPose-B (student): 75.8 → 76.0. Modest but consistent.
Here is the novel idea. We add a single learnable token t to the teacher's input, alongside the visual tokens. Then:
The key equation for optimizing the knowledge token:
where T is the frozen teacher and Kgt is the ground-truth heatmaps.
This is the fascinating part. The token t* is a single vector (dimension 1024 for ViT-L) that, when added to the teacher's input, modulates its attention to improve predictions. Through self-attention, every visual token can attend to t*, effectively receiving a "hint" that biases the teacher toward more accurate outputs.
When this same token is given to the student, it provides a similar hint. Think of it as a compressed summary of the teacher's expertise — a single token that captures what the teacher "wishes" it could tell the student about how to process human body images.
| Method | Teacher | Student AP | Δ |
|---|---|---|---|
| Baseline (no distillation) | — | 75.8 | — |
| Output distillation only | ViTPose-L | 76.0 | +0.2 |
| Knowledge token only | ViTPose-L | 76.3 | +0.5 |
| Output + Knowledge token | ViTPose-L | 76.6 | +0.8 |
The knowledge token alone (+0.5 AP) is more effective than output distillation alone (+0.2 AP). Combined, they give +0.8 AP with negligible extra memory (one additional token = one extra row in the attention matrix). The two methods are complementary: output distillation aligns predictions, while the knowledge token provides structural guidance.
| Model | Backbone | Resolution | Speed (fps) | AP |
|---|---|---|---|---|
| SimpleBaseline | ResNet-152 | 256×192 | 829 | 73.5 |
| HRNet | HRNet-W48 | 256×192 | 649 | 75.1 |
| HRNet | HRNet-W48 | 384×288 | 309 | 76.3 |
| TokenPose-L/D24 | HRNet-W48 | 256×192 | 602 | 75.8 |
| TransPose-H/A6 | HRNet-W48 | 256×192 | 309 | 75.8 |
| HRFormer-B | HRFormer-B | 384×288 | 78 | 77.2 |
| ViTPose-B | ViT-B | 256×192 | 944 | 75.8 |
| ViTPose-B* | ViT-B | 256×192 | 944 | 77.1 |
| ViTPose-L* | ViT-L | 256×192 | 411 | 78.7 |
| ViTPose-H* | ViT-H | 256×192 | 241 | 79.5 |
* = multi-dataset training
Key observations:
Using ViTAE-G (1B parameters), 576×432 resolution, multi-dataset training, and a stronger person detector:
| Method | AP | AP50 | AP75 | APM | APL |
|---|---|---|---|---|---|
| UDP++ (17-model ensemble, 2020 COCO winner) | 80.8 | 94.9 | 88.1 | 77.4 | 85.7 |
| ViTPose (single model) | 80.9 | 94.8 | 88.1 | 77.5 | 85.9 |
| ViTPose+ (3-model ensemble) | 81.1 | 95.0 | 88.2 | 77.8 | 86.0 |
OCHuman is the hardest pose benchmark — heavily overlapping, occluded people. Prior methods top out around 74 AP. ViTPose-G hits 92.8 AP, a 19 point improvement. Why? Global self-attention naturally handles occlusion. When a wrist is behind a torso, the wrist token can attend to visible body parts (head, other hand, feet) to infer the hidden location. No occlusion-specific module needed.
ViTPose sits at the intersection of two trends:
| Concept | Formula | What It Means |
|---|---|---|
| Patch embedding | F0 = PatchEmbed(X) ∈ R(H/d)×(W/d)×C | Image → token grid (d=16 default) |
| Transformer layer | Fi+1 = Fi + FFN(LN(Fi + MHSA(LN(Fi)))) | Self-attention + feed-forward, residual |
| Classic decoder | K = Conv(Deconv(Deconv(Fout))) | 2× upsample twice + predict 17 heatmaps |
| Simple decoder | K = Conv(Bilinear(ReLU(Fout))) | 4× bilinear upsample + predict |
| Training loss | L = MSE(Kpred, Kgt) | Match predicted heatmaps to Gaussian GT |
| Knowledge token | t* = argmint MSE(T({t;X}), Kgt) | Optimize token against frozen teacher |