Xu, Zhang, Zhang, Tao — University of Sydney & JD Explore Academy, NeurIPS 2022

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Plain ViT + two deconv layers = state-of-the-art pose estimation. No fancy modules. No hierarchical features. No domain-specific tricks. Just a transformer and a lightweight decoder.

Prerequisites: Vision Transformers (ViT) + Heatmap-based pose estimation
10
Chapters
8+
Simulations

Chapter 0: The Problem

You want to detect human body keypoints — wrists, elbows, shoulders, knees, ankles — from a cropped image of a person. This is human pose estimation, and it powers everything from sports analytics to sign language recognition to AR avatar tracking.

By 2022, the state-of-the-art methods all share a common pattern: they use increasingly complex architectures designed specifically for this task.

Every method adds more domain-specific complexity. Multi-resolution branches. Special keypoint tokens. CNN-transformer hybrids. Cross-attention modules. Each new paper is more elaborate than the last.

The question nobody asked: What if none of this complexity is necessary? What if a plain vision transformer — the simplest possible structure, with zero pose-specific design — could match or beat all of these methods? Not as a starting point for further improvements, but as the final answer?

This is exactly what ViTPose demonstrates. And the results are shocking: a plain ViT-H with just two deconvolution layers achieves 79.1 AP on COCO val — beating HRFormer-B (75.6 AP) while running 50% faster (241 fps vs 158 fps). The simplest model wins.

What is the common pattern in pre-ViTPose pose estimation methods?

Chapter 1: The Key Insight

ViTPose's thesis is radical in its simplicity: a plain, non-hierarchical vision transformer with MAE pre-training has such strong feature representations that a trivial decoder is all you need for state-of-the-art pose estimation.

Let's unpack why this works by considering what happens inside a ViT during self-attention. Every token (patch) attends to every other token. After 12+ layers of this global communication, each token "knows about" the entire image. Compare this to a CNN, where the receptive field grows slowly — a ResNet-50 needs deep stacking just to let distant pixels influence each other.

The Decoder Experiment That Proves the Point

The authors test two decoders:

Here is what happens when you swap from classic to simple decoder:

BackboneClassic Decoder APSimple Decoder APΔ
ResNet-5071.853.1−18.7
ResNet-15273.555.3−18.2
ViT-B75.875.5−0.3
ViT-L78.378.2−0.1
ViT-H79.178.9−0.2
This is the key result of the entire paper. ResNets lose 18 AP (25% relative) when you simplify the decoder. ViTs lose 0.2 AP (0.25% relative). The ViT backbone has already done all the heavy lifting — the features are so linearly separable that almost any decoder works. The CNN features are NOT linearly separable, so the decoder has to do substantial non-linear processing to produce good heatmaps.

This finding explains why all those prior methods needed complex decoders: they were compensating for weak backbone features. With a strong enough backbone, complexity becomes overhead.

The Four Properties

ViTPose demonstrates four surprisingly strong properties of plain ViTs for pose estimation:

  1. Simplicity: No domain-specific architecture needed
  2. Scalability: Performance improves consistently from 100M to 1B parameters
  3. Flexibility: Works with different pre-training data, resolutions, attention types, and finetuning strategies
  4. Transferability: Knowledge from large models transfers to small ones via a learnable token
When switching from the classic decoder to the simple decoder, ViT-L loses 0.1 AP while ResNet-152 loses 18.2 AP. What does this tell us?

Chapter 2: The Architecture

Let's trace a person image through the entire ViTPose pipeline. The simplicity is striking.

Step 1: Person Detection (External)

ViTPose follows the top-down paradigm: first detect people with a person detector, then estimate keypoints for each cropped person instance. The input is a single cropped person image x ∈ RH×W×3, typically 256×192 pixels.

Step 2: Patch Embedding

The image is split into non-overlapping patches of size d×d (default d=16). Each patch is linearly projected to a C-dimensional token:

F0 = PatchEmbed(x) ∈ R(H/d) × (W/d) × C

For 256×192 input with d=16: (256/16) × (192/16) = 16 × 12 = 192 tokens, each of dimension C (768 for ViT-B, 1024 for ViT-L, 1280 for ViT-H).

Step 3: Transformer Encoder (The Backbone)

Each of N transformer layers applies the standard two-step update:

F'i+1 = Fi + MHSA(LN(Fi))
Fi+1 = F'i+1 + FFN(LN(F'i+1))

That's it. No cross-attention. No multi-resolution branches. No feature pyramid. Just self-attention and feed-forward, repeated N times. The spatial resolution stays constant throughout — every transformer layer operates on the same 16×12 grid.

Data flow: Input [256, 192, 3] → patch embed [16, 12, C] → 12/24/32 transformer layers → Fout [16, 12, C]. The backbone outputs features at stride 16 (1/16 resolution). Every token has attended to every other token for N layers, so every spatial position "knows" about the whole image.

Step 4: Decoder (Heatmap Prediction)

The features Fout need to be upsampled and converted to K=17 keypoint heatmaps. The classic decoder:

1Fout [16, 12, C] — reshape to 2D spatial map
2Deconv block 1: 4×4 deconv, stride 2 → BN → ReLU → [32, 24, 256]
3Deconv block 2: 4×4 deconv, stride 2 → BN → ReLU → [64, 48, 256]
41×1 conv → K [64, 48, 17] heatmaps (stride 4 resolution)

Each heatmap is a 64×48 spatial probability map for one keypoint. The predicted keypoint location is the argmax of each heatmap (with sub-pixel refinement via UDP post-processing).

Step 5: Training

Ground-truth heatmaps are generated by placing a 2D Gaussian at each annotated keypoint location. The loss is MSE between predicted and ground-truth heatmaps:

L = MSE(Kpred, Kgt) = (1/K) ∑k=1..17 || Kpred(k) − Kgt(k) ||2

Backbone is initialized with MAE pre-trained weights. Training uses AdamW with learning rate 5e-4, layer-wise learning rate decay, and stochastic drop path. 210 epochs total, with LR decay at epochs 170 and 200.

What spatial resolution do ViTPose features maintain throughout the entire transformer backbone?

Chapter 3: The Two Decoders

The decoder comparison is ViTPose's most important experiment. It reveals what kind of features ViTs learn versus CNNs. Let's look at both decoders in detail and understand why they tell such different stories for different backbones.

Classic Decoder

# Classic decoder: 2 deconv blocks + prediction
class ClassicDecoder(nn.Module):
    def __init__(self, in_channels, num_keypoints=17):
        self.deconv1 = nn.Sequential(
            nn.ConvTranspose2d(in_channels, 256, 4, stride=2, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )
        self.deconv2 = nn.Sequential(
            nn.ConvTranspose2d(256, 256, 4, stride=2, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )
        self.predict = nn.Conv2d(256, num_keypoints, 1)

    def forward(self, x):
        # x: [B, C, H/16, W/16]
        x = self.deconv1(x)   # [B, 256, H/8, W/8]
        x = self.deconv2(x)   # [B, 256, H/4, W/4]
        return self.predict(x)  # [B, 17, H/4, W/4]

Simple Decoder

# Simple decoder: bilinear upsample + conv
class SimpleDecoder(nn.Module):
    def __init__(self, in_channels, num_keypoints=17):
        self.predict = nn.Conv2d(in_channels, num_keypoints, 3, padding=1)

    def forward(self, x):
        # x: [B, C, H/16, W/16]
        x = F.interpolate(x, scale_factor=4, mode='bilinear')  # [B, C, H/4, W/4]
        x = F.relu(x)
        return self.predict(x)  # [B, 17, H/4, W/4]

The simple decoder has essentially one learnable layer — a 3×3 convolution that maps C channels to 17 keypoint heatmaps. The bilinear upsample is parameter-free. This is as close to "linear probing" as a decoder can get.

Why CNNs Collapse With the Simple Decoder

ResNet-152 drops from 73.5 to 55.3 AP (−18.2). Why?

CNN features at the final layer are high-level and abstract but spatially coarse. They encode "there's a person here" but not "the left wrist is at pixel (134, 87)." The classic decoder's two deconv blocks with learnable parameters perform non-trivial spatial refinement — they learn to "unpack" the abstract features back into precise spatial locations. Without this learned upsampling, you just get blurry bilinear interpolation of features that were never designed to be spatially precise.

Why ViTs Don't Need It

ViT features are fundamentally different. Through N layers of global self-attention, every token encodes information about every other token's position. The features at each spatial location already contain precise, globally-informed positional information. Bilinear upsampling of these features preserves this precision because the information is encoded in the channel dimensions, not just the spatial layout.

Another way to see it: In a CNN, spatial information is stored spatially — the position of a feature activation IS its meaning. In a ViT, spatial information is stored in the embedding dimensions — each token knows where it is and where everything else is, regardless of its position in the grid. That's why you can upsample ViT features with bilinear interpolation and lose almost nothing — you're only interpolating the spatial grid, while the rich positional information in the embeddings is preserved.

This is also why the AP50 metric (which allows 50% IoU tolerance) shows almost no difference between decoders for ViTs (90.7 vs 90.6 for ViT-B). The keypoints are already in roughly the right place — the decoder only matters for sub-pixel precision.

Why do CNNs lose 18 AP with the simple decoder while ViTs lose only 0.3 AP?

Chapter 4: Scalability — The Pareto Front

Because ViTPose's architecture is so simple, scaling it is trivial: just use a bigger ViT. No need to redesign multi-resolution branches, adjust cross-attention modules, or rebalance feature pyramid weights. You literally swap the backbone and retrain.

ModelBackboneLayersDimParamsSpeed (fps)AP
ViTPose-BViT-B1276886M94475.8
ViTPose-LViT-L241024307M41178.3
ViTPose-HViT-H321280632M24179.1
ViTPose-GViTAE-G~1B80.9*

* ViTPose-G uses larger input (576×432), multi-dataset training, and a better detector. Shown for completeness.

Three observations from this table:

1. Consistent gains across scale. B→L gives +2.5 AP. L→H gives +0.8 AP. No saturation — bigger models keep improving. This is in stark contrast to CNNs, where ResNet-50 to ResNet-152 gives only +1.7 AP and larger ResNets show diminishing returns.

2. Speed remains competitive. ViTPose-H runs at 241 fps on an A100 with batch size 64. That's faster than HRFormer-B (158 fps) despite having 15× more parameters. Why? Because ViTPose has a single-branch architecture operating at 1/16 resolution, while HRFormer maintains four parallel branches at 1/4 resolution. Fewer branches, lower resolution → better hardware utilization.

3. The Pareto front shifts. No prior method matches ViTPose at any throughput level. At 944 fps, ViTPose-B (75.8 AP) beats HRNet-W48 (75.1 AP at 649 fps) — both better accuracy AND faster. At 241 fps, ViTPose-H (79.1 AP) crushes HRFormer-B (75.6 AP at 158 fps).

Why plain ViTs are hardware-friendly: Modern GPUs are optimized for large, uniform matrix multiplications. A plain ViT is essentially a sequence of identical matmuls (QKV projections and FFNs) with no branching or irregular data movement. HRNet's four parallel branches with repeated cross-resolution fusion create irregular memory access patterns that underutilize GPU tensor cores. Simplicity isn't just elegant — it's fast.

What Scaling Actually Changes

When you go from ViT-B to ViT-H, you're increasing:

Each additional layer lets tokens refine their understanding of the image. For pose estimation, this matters most for occluded keypoints: a wrist hidden behind a torso requires deep reasoning about body structure. More layers = better reasoning = more accurate localization of hard keypoints.

ViTPose-H has 15× more parameters than HRFormer-B but runs 50% faster. Why?

Chapter 5: Flexibility

ViTPose demonstrates flexibility along four axes. Each one challenges a conventional assumption about how pose estimation models should be trained.

Axis 1: Pre-training Data

The standard recipe: pre-train the backbone on ImageNet-1K (1.3M images), then finetune on COCO for pose. But what if you don't have (or don't want to use) ImageNet?

Pre-training DataVolumeAP
ImageNet-1K1.3M75.8
COCO only (cropped persons)150K74.5
COCO + AI Challenger (cropped)500K75.8
COCO + AI Challenger (no crop)300K75.8

With COCO + AI Challenger, ViTPose matches ImageNet pre-training with less than half the data. Even COCO alone (150K images, 10× less than ImageNet) only drops 1.3 AP. The MAE pre-training learns useful representations from whatever images you give it — domain-specific data is actually more data-efficient than generic ImageNet images.

What this means in practice: If you're building a pose estimator for a specific domain (medical imaging, sports, industrial), you can pre-train on your own unlabeled data without ever touching ImageNet. This eliminates a data dependency that has been assumed necessary since 2012.

Axis 2: Input Resolution

Input SizeTokensAP
224 × 22419674.9
256 × 19219275.8
384 × 28843276.9
576 × 43297277.8

Performance scales smoothly with resolution. An interesting detail: 256×256 (256 tokens) gives the same 75.8 AP as 256×192 (192 tokens). Why? The average person bounding box in COCO has a 4:3 aspect ratio, so the rectangular input wastes fewer pixels on background padding.

Axis 3: Attention Type

Full self-attention on high-resolution features (1/8 instead of 1/16 by using stride-8 patch embedding) gives the best results but requires 36GB memory even with FP16. ViTPose explores alternatives:

Attention TypeMemoryAP
Full attention (1/8)36.1 GB77.4
Window (8×8)21.2 GB66.4
Window + Shift21.2 GB76.4
Window + Pool22.9 GB76.4
Window + Shift + Pool22.9 GB76.8
Window (16×12) + Shift + Pool26.8 GB77.1

Pure window attention is catastrophic (−11 AP) because no cross-window communication means the model can't reason about distant keypoints. But adding shift-window (from Swin) or pooling-window restores most of the performance at 40% less memory. The two mechanisms are complementary — combined, they approach full attention quality (76.8 vs 77.4) at drastically reduced cost.

Axis 4: Partial Finetuning

Finetuning StrategyAPΔ
Full finetuning (all parameters)75.8
Freeze MHSA, train FFN + decoder75.1−0.7
Freeze FFN, train MHSA + decoder72.8−3.0
MHSA is task-agnostic, FFN is task-specific. Freezing the attention modules costs only 0.7 AP — the self-attention patterns learned during MAE pre-training (token similarity, spatial relationships) transfer directly to pose estimation. But freezing FFN costs 3.0 AP — the feed-forward layers need to be adapted for keypoint-specific feature transformation. This finding reveals the division of labor inside a ViT: attention routes information, FFN transforms it.
Freezing MHSA costs only 0.7 AP but freezing FFN costs 3.0 AP. What does this tell us about the roles of these modules?

Chapter 6: Multi-Dataset Training

Here's a practical problem: you have multiple pose datasets (COCO, AI Challenger, MPII), each with different keypoint definitions, different numbers of keypoints, and different annotation styles. Most methods train separate models for each dataset. Can we train one model on all of them?

The Architecture Trick

ViTPose's decoder is so lightweight that adding extra decoders for additional datasets is nearly free. The strategy:

1Sample batch: randomly mix images from COCO, AIC, MPII
2Forward all through shared ViT backbone → Fout
3Route each image to its dataset-specific decoder
4Compute per-dataset losses, sum, backpropagate

The Results

Training DataAP on COCO valΔ
COCO only75.8
COCO + AI Challenger77.0+1.2
COCO + AIC + MPII77.1+1.3

Adding AI Challenger (which has 350K+ labeled instances) gives a big +1.2 AP boost. Adding MPII (only 40K instances) gives another +0.1 AP despite being much smaller. The shared backbone learns better features when trained on diverse data, even though the target evaluation is COCO-only.

Why this works: Different datasets capture different poses, environments, and body configurations. AI Challenger includes many Asian athletes in competitive sports. MPII contains diverse everyday activities. The backbone learns a more general understanding of human bodies by seeing all this variation, even though the decoders are dataset-specific.

The extra computational cost is minimal: the three decoders together add less than 2% to the total FLOPs because the backbone dominates the compute. With multi-dataset training, ViTPose-H reaches 79.5 AP on COCO val — a +0.4 AP improvement for almost no extra cost.

Cross-Dataset Transfer Without Finetuning

A remarkable detail: after multi-dataset training, ViTPose is evaluated directly on each dataset's val set without any dataset-specific finetuning. On OCHuman (a heavily occluded variant of COCO), ViTPose-G achieves 92.8 AP — over 10 AP above the previous state-of-the-art (MIPNet at 74.1 AP). The plain ViT backbone handles occlusion naturally through global self-attention, without any occlusion-specific modules.

Why is multi-dataset training nearly free in terms of compute for ViTPose?

Chapter 7: Knowledge Distillation — The Knowledge Token

ViTPose-H is great, but at 632M parameters it's too large for some deployment scenarios. Can we transfer the knowledge of the large model into a smaller one? Standard knowledge distillation (KD) works, but ViTPose introduces a clever addition: the knowledge token.

Standard Output Distillation

The baseline approach: train the student to match the teacher's heatmap outputs.

Lod = MSE(Kstudent, Kteacher)

This alone gives +0.2 AP when transferring from ViTPose-L (teacher) to ViTPose-B (student): 75.8 → 76.0. Modest but consistent.

The Knowledge Token Trick

Here is the novel idea. We add a single learnable token t to the teacher's input, alongside the visual tokens. Then:

1Take the well-trained teacher (ViTPose-L). Freeze ALL its weights.
2Append a random learnable token t to the patch tokens: input = {t; X}
3Train ONLY the token t for a few epochs to minimize MSE(Teacher({t; X}), Kgt)
4Freeze the optimized t*. Append it to the student's patch tokens during training.
5Train the student normally: L = MSE(Student({t*; X}), Kgt)

The key equation for optimizing the knowledge token:

t* = arg mint MSE(T({t; X}), Kgt)

where T is the frozen teacher and Kgt is the ground-truth heatmaps.

What Does the Knowledge Token Encode?

This is the fascinating part. The token t* is a single vector (dimension 1024 for ViT-L) that, when added to the teacher's input, modulates its attention to improve predictions. Through self-attention, every visual token can attend to t*, effectively receiving a "hint" that biases the teacher toward more accurate outputs.

When this same token is given to the student, it provides a similar hint. Think of it as a compressed summary of the teacher's expertise — a single token that captures what the teacher "wishes" it could tell the student about how to process human body images.

Why this is different from a class token: The CLS token in ViT learns to aggregate global information for classification. The knowledge token t* is optimized to improve the teacher's predictions — it encodes task-specific priors about human pose that the student model couldn't learn on its own from its smaller capacity.

Results: Combining Both Methods

MethodTeacherStudent APΔ
Baseline (no distillation)75.8
Output distillation onlyViTPose-L76.0+0.2
Knowledge token onlyViTPose-L76.3+0.5
Output + Knowledge tokenViTPose-L76.6+0.8

The knowledge token alone (+0.5 AP) is more effective than output distillation alone (+0.2 AP). Combined, they give +0.8 AP with negligible extra memory (one additional token = one extra row in the attention matrix). The two methods are complementary: output distillation aligns predictions, while the knowledge token provides structural guidance.

How is the knowledge token t* trained?

Chapter 8: Experiments

COCO Val Set: The Main Comparison

ModelBackboneResolutionSpeed (fps)AP
SimpleBaselineResNet-152256×19282973.5
HRNetHRNet-W48256×19264975.1
HRNetHRNet-W48384×28830976.3
TokenPose-L/D24HRNet-W48256×19260275.8
TransPose-H/A6HRNet-W48256×19230975.8
HRFormer-BHRFormer-B384×2887877.2
ViTPose-BViT-B256×19294475.8
ViTPose-B*ViT-B256×19294477.1
ViTPose-L*ViT-L256×19241178.7
ViTPose-H*ViT-H256×19224179.5

* = multi-dataset training

Key observations:

  1. ViTPose-B matches TokenPose and TransPose at the same AP (75.8) while being 1.5–3× faster. Those methods use HRNet-W48 + elaborate transformer modules; ViTPose uses a plain ViT + two deconv layers.
  2. ViTPose-B* (77.1) approaches HRFormer-B (77.2) with 12× higher throughput (944 vs 78 fps). Multi-dataset training closes the gap for free at inference time.
  3. ViTPose-H* (79.5 AP) sets a new state-of-the-art among methods evaluated on the val set with single-model inference.

COCO Test-Dev: The Ultimate Benchmark

Using ViTAE-G (1B parameters), 576×432 resolution, multi-dataset training, and a stronger person detector:

MethodAPAP50AP75APMAPL
UDP++ (17-model ensemble, 2020 COCO winner)80.894.988.177.485.7
ViTPose (single model)80.994.888.177.585.9
ViTPose+ (3-model ensemble)81.195.088.277.886.0
A single ViTPose model beats a 17-model ensemble. UDP++ won the 2020 COCO Keypoint Challenge by ensembling 17 models. ViTPose-G surpasses it with a single model (80.9 vs 80.8 AP). This is the clearest demonstration that scaling a simple architecture outperforms ensembling complex ones.

OCHuman: Extreme Occlusion

OCHuman is the hardest pose benchmark — heavily overlapping, occluded people. Prior methods top out around 74 AP. ViTPose-G hits 92.8 AP, a 19 point improvement. Why? Global self-attention naturally handles occlusion. When a wrist is behind a torso, the wrist token can attend to visible body parts (head, other hand, feet) to infer the hidden location. No occlusion-specific module needed.

How does a single ViTPose model compare to the 2020 COCO Keypoint Challenge winner (UDP++)?

Chapter 9: Connections

Where ViTPose Fits

ViTPose sits at the intersection of two trends:

Key Equations Cheat Sheet

ConceptFormulaWhat It Means
Patch embeddingF0 = PatchEmbed(X) ∈ R(H/d)×(W/d)×CImage → token grid (d=16 default)
Transformer layerFi+1 = Fi + FFN(LN(Fi + MHSA(LN(Fi))))Self-attention + feed-forward, residual
Classic decoderK = Conv(Deconv(Deconv(Fout)))2× upsample twice + predict 17 heatmaps
Simple decoderK = Conv(Bilinear(ReLU(Fout)))4× bilinear upsample + predict
Training lossL = MSE(Kpred, Kgt)Match predicted heatmaps to Gaussian GT
Knowledge tokent* = argmint MSE(T({t;X}), Kgt)Optimize token against frozen teacher

Related Lessons on This Site

What the Paper Doesn't Say

The broader lesson: ViTPose is less about pose estimation and more about the power of not designing. When your backbone is strong enough (thanks to scale + self-supervised pre-training), the best architecture is the simplest one. This principle has since been validated across object detection (ViTDet), segmentation (SAM), and dense prediction generally.
What is a key dependency that makes ViTPose's "simplicity" possible?