High-resolution human-centric vision transformers (0.4B-5B) pretrained on 1 billion human images with unified MAE + contrastive learning. SOTA on pose, segmentation, normals, pointmaps, and albedo.
You want to build a single vision backbone that handles every task involving humans: estimating where their 308 body, face, and hand keypoints are; segmenting every body part down to individual fingers; predicting the 3D surface normal at each pixel; reconstructing the full 3D shape as a pointmap; and recovering the intrinsic skin and clothing color under arbitrary lighting.
Each of these tasks has a different relationship with image information. Pose estimation cares about semantics — recognizing that a blob of pixels is an elbow, regardless of texture or color. Surface normals care about low-level detail — the subtle shading gradients that reveal curvature. Albedo needs appearance fidelity — the actual pixel colors, stripped of lighting effects.
This creates a fundamental tension in how you pretrain your backbone.
Masked Image Modeling (MIM) — exemplified by MAE — masks 75% of an image's patches and trains the model to reconstruct the missing pixels. This forces the encoder to learn fine-grained spatial relationships: texture patterns, color distributions, edge orientations. MAE-pretrained models excel at dense prediction tasks (segmentation, normals) but struggle at semantic understanding (is this a hand or a foot?).
Contrastive Learning (CL) — exemplified by DINO and DINOv2 — trains the model to produce similar embeddings for different augmented views of the same image. This creates representations rich in high-level semantics (identity, pose, action). But aggressive augmentations (random crops, color jitter) deliberately destroy low-level appearance cues. DINO features are great for classification but throw away the texture and color details that normals and albedo require.
The original Sapiens (v1) used MAE-only pretraining. It captured pixel-level structure well, but its semantic understanding was limited — especially for tasks like pose estimation that need the model to "know" which body part it is looking at.
Toggle between pretraining methods. MAE learns low-level structure (texture, edges), CL learns high-level semantics (identity, pose). Neither alone covers all tasks. Watch how each method scores on five human-centric tasks.
Sapiens2's insight is to unify MAE and contrastive learning in a single pretraining objective — but in a very specific way that avoids the pitfalls of prior hybrids.
The combined loss is deceptively simple:
But the devil is in the details. Prior hybrids like iBOT and DINOv2 also combined MIM with contrastive learning. They failed for human-centric tasks because they operated entirely in latent space — predicting cluster tokens or teacher embeddings instead of actual pixels. This, combined with aggressive color and geometric augmentations, caused representation drift: the features gradually lost the appearance information (color, texture) that tasks like albedo estimation and surface normals depend on.
Anchor 1 — MAE in pixel space: The reconstruction loss operates on actual pixel values, not latent embeddings. This forces the encoder to preserve fine-grained appearance information at every layer. You cannot predict exact pixel colors if your features have drifted away from the low-level signal.
Anchor 2 — No color augmentation on global views: The contrastive loss uses a student-teacher framework with multiple augmented views. But critically, Sapiens2 does NOT apply color augmentation (jitter, grayscale) to the global views used for MAE reconstruction. This preserves the color distribution that is essential for albedo and segmentation. Local crops still get augmented for robustness.
The practical result: a single pretrained backbone where early-layer features capture texture and color (useful for normals, albedo), middle-layer features capture spatial structure (useful for segmentation), and deep-layer features capture semantics (useful for pose). All five downstream tasks can tap the same encoder at different depths.
Self-supervised pretraining is only as good as the data it trains on. Sapiens2 introduces Humans-1B: one billion high-quality human images curated from a pool of four billion web images. No task labels. No bounding boxes. No keypoint annotations. Just humans, in diverse settings, at high resolution.
Starting from 4B web images, the pipeline applies six sequential filters:
The result: 1 billion images spanning every ethnicity, age group, body type, clothing style, activity, and environment. This is the largest human-centric pretraining dataset ever assembled — 3x larger than Sapiens v1's 300M images. And crucially: no task-specific labels and no human-specific priors are injected during pretraining. The model learns human structure purely from self-supervised reconstruction and view matching.
This is the heart of Sapiens2. Every image passes through two parallel objectives simultaneously: pixel-level reconstruction and cross-view contrastive matching.
Given an input image x, the system generates V augmented views {x1, x2, ..., xV}. Two are "global" views (large crops, 224x224) and the rest are "local" views (smaller crops, 96x96). For each view, 75% of patches are randomly masked — the standard MAE ratio.
For each view, only the visible 25% of patches enter the encoder. After encoding, mask tokens are scattered back to their original positions. A lightweight decoder reconstructs the full image. The loss is MSE over masked patches only:
Where Mv is the set of masked patches in view v, x̃p is the true pixel patch, and x̂p is the predicted reconstruction. Computing loss only on masked patches forces the encoder to infer missing structure from visible context — not just copy nearby pixels.
The [CLS] token from each view is extracted and projected through a small head into K logits, then softmaxed into a probability distribution. A student network processes the views; an EMA teacher (τ = 0.992) provides targets. The contrastive loss is cross-entropy between student and teacher distributions across different views:
Where S is the set of cross-view pairs (student view i matched to teacher view j, i ≠ j), pi is the student's softmax distribution, and qj is the teacher's (sharpened) distribution. This forces the encoder to produce [CLS] representations that are invariant to viewpoint and crop — capturing what is in the image, not where.
Here is where Sapiens2 departs from DINOv2. Global views receive geometric augmentations (random crop, flip) but NO color augmentations (no jitter, no grayscale, no solarize). Local views receive both. Why? Because the MAE branch reconstructs pixels from global views. If you jitter the colors of the global view, the MAE loss forces the encoder to learn color-jittered representations — destroying the color fidelity needed for albedo and segmentation.
Toggle between "MAE Only" (Sapiens v1), "CL Only" (DINOv2), and "Combined" (Sapiens2). Watch how each method processes an image and what features it learns. The PCA visualization shows learned features: texture-only vs semantic-only vs both.
The total loss combines both branches:
Where λ balances the two objectives. In practice, the gradients from both losses flow through the same encoder, creating features that simultaneously encode pixel-level detail (for reconstruction) and semantic content (for view matching).
Sapiens2 is a plain Vision Transformer with several modern upgrades. The family spans four sizes, from 0.4B to 5.1B parameters.
| Model | Params | Hidden | Layers | Heads | FLOPs |
|---|---|---|---|---|---|
| Sapiens2-0.4B | 0.4B | 1024 | 24 | 16 | 1.26T |
| Sapiens2-0.8B | 0.8B | 1280 | 32 | 16 | 2.59T |
| Sapiens2-1B | 1.5B | 1536 | 40 | 24 | 4.72T |
| Sapiens2-5B | 5.1B | 2432 | 56 | 32 | 15.7T |
Grouped Query Attention (GQA): The middle layers use GQA — sharing key/value heads across multiple query heads. This reduces memory and compute while maintaining representation quality. GQA is applied to the middle third of layers where attention patterns are most redundant.
SwiGLU FFN: The standard MLP (Linear → GELU → Linear) is replaced with SwiGLU (Linear × Swish(Linear) → Linear). This gated activation consistently outperforms GELU across model sizes. The FFN hidden dimension is set to 8/3 × hidden_dim to match parameter count.
QK-Norm: Query and key projections are normalized before computing attention scores. This prevents attention logit explosion in deep networks (56 layers for the 5B model), stabilizing training without learning rate warmup tricks.
RMSNorm: Layer normalization is replaced with RMSNorm throughout. RMSNorm drops the mean-centering step, keeping only the variance normalization. This is ~10-15% faster than LayerNorm with equivalent quality.
PixelShuffle decoder: The MAE decoder upsamples using PixelShuffle rather than transposed convolutions. PixelShuffle rearranges channel dimensions into spatial dimensions, producing sharper reconstructions without checkerboard artifacts.
Compare model sizes. Hover/tap a model to see its specs. FLOPs grow faster than parameters due to increased sequence length at higher resolutions.
Human-centric tasks demand high resolution. A 4K image (3840x2160) with 16x16 patches produces over 32,000 tokens. Standard self-attention at O(n²) would need ~1 trillion operations per layer. That is completely impractical, even at inference time.
Sapiens2 solves this with a two-stage attention scheme that processes local structure first, then global context.
The first K layers use windowed self-attention. The token grid is partitioned into non-overlapping spatial windows (e.g., 14x14 patches each). Each window attends only within itself. This is O(n · w²) instead of O(n²), where w is the window size. These layers learn local structure: texture patterns, edge orientations, color gradients — information that is inherently local.
After the local stage, a pooling operation downsamples the token grid by stride ω. This reduces the sequence length by ω². For example, with ω=2 and 32,000 input tokens, pooling produces 8,000 tokens. The [CLS] token guides the pooling to preserve the most informative spatial positions.
The remaining L layers operate on the downsampled token sequence with full global self-attention. Now at 8,000 tokens (post-pooling), global attention is feasible. These layers learn long-range relationships: the left hand relates to the right shoulder; the person's pose informs which way their torso faces; the background context helps resolve ambiguous body configurations.
The decoder receives the global features, upsamples back to the local resolution, and outputs at 2K resolution (half the input). For tasks requiring the full 4K, a simple bilinear upsampling is applied at post-training time.
Drag the slider to move through the network. First K layers use local windows (highlighted boxes). After pooling, remaining L layers use global attention (all-to-all connections).
After pretraining, the backbone is frozen. Five lightweight task-specific heads are trained independently, each converting the universal features into a specific output format. This is the "foundation model" paradigm: one backbone, many tasks.
Predict 2D coordinates for 308 keypoints: 25 body, 40 hands (20 per hand), and 243 face landmarks. The head is a simple deconvolution stack that produces per-keypoint heatmaps. Loss: MSE with Online Hard Example Mining (OHEM) — the loss focuses on the hardest 30% of keypoints per image (occluded joints, ambiguous poses), preventing the model from coasting on easy visible keypoints.
Classify each pixel into one of 29 body parts (head, torso, upper arm left, forearm left, hand left, etc.). The head is a standard segmentation decoder. Loss: cross-entropy + Dice loss. Dice loss directly optimizes the IoU metric, preventing class imbalance from dominating (tiny parts like fingers get equal weight to large parts like torso).
Predict the 3D position (X, Y, Z) of each pixel in a focal-normalized coordinate frame. This produces a dense 3D reconstruction of the person from a single image. Loss: L2 + gradient loss. The gradient loss penalizes differences in spatial derivatives — ensuring smooth surfaces without discontinuities at patch boundaries.
Predict the surface normal direction (3D unit vector) at each pixel. Normals encode local surface curvature: flat surfaces have uniform normals; wrinkles, folds, and muscle definition create rapidly varying normals. Loss: (1 - cos θ) + L2 + gradient loss. The cosine term directly optimizes angular accuracy; L2 ensures magnitude stability; gradient loss preserves sharp creases.
Predict the intrinsic diffuse color at each pixel — what the surface looks like under uniform white light, stripping away shadows, specular highlights, and ambient lighting. Loss: L2 + gradient + mean-color alignment. The mean-color alignment term prevents the model from predicting a globally shifted color palette (e.g., everything too warm or too cool).
How do you measure whether a pretrained backbone has actually learned universal human representations? You probe it: freeze the backbone, attach a minimal linear head, and evaluate on each task. If the backbone is good, even a linear probe should perform well — because the features already encode the right information.
For each of the five tasks (pose, segmentation, normals, pointmap, albedo), the paper trains a single linear layer on top of frozen backbone features. No decoder stack. No skip connections. No fine-tuning. Just a matrix multiply from feature dimension to output dimension. This is the harshest possible test of feature quality.
Sapiens2-5B is compared against the best available pretrained vision backbones:
The results are especially dramatic for tasks that need both semantics and detail. On normals, linear probing of Sapiens2 features achieves angular errors competitive with full fine-tuned models. On segmentation, the linear probe mIoU exceeds many full models. This means the features are so well-organized that a linear separator can carve them into 29 body parts with high accuracy.
Compare backbone quality across tasks via linear probing. Higher bars = better features. Sapiens2 leads on every task despite DINOv3 having 37% more parameters.
Sapiens2 achieves state-of-the-art across all five human-centric tasks, with massive improvements over Sapiens v1 in every category.
| Task | Metric | Sapiens v1 | Sapiens2-5B | Improvement |
|---|---|---|---|---|
| Pose | mAP | 78.3 | 82.3 | +4.0 |
| Segmentation | mIoU | 58.2 | 82.5 | +24.3 |
| Normals | Angular error | 12.3° | 6.73° | -45.6% |
| Pointmap | MPJPE | — | SOTA | Beats VGGT, MoGe, DUSt3R |
| Albedo | MAE / PSNR | — | 0.012 / 32.6 | New task (no v1 baseline) |
The +24.3 mIoU improvement on segmentation is the most dramatic result in the paper. Why such a massive gain? Segmentation requires recognizing which body part each pixel belongs to — a fundamentally semantic task. MAE-only pretraining (v1) learned spatial structure but not semantic categories. Adding contrastive learning teaches the model that an elbow is an elbow regardless of skin color, sleeve length, or viewing angle. The CL branch provides exactly the semantic feature organization that segmentation needs.
The 45.6% reduction in angular error is equally telling. Surface normals depend on subtle shading gradients — low-level appearance cues. But they also need semantic context: the model must know it is looking at a shoulder to correctly predict the curvature, because shoulders have characteristic geometry. MAE gives the shading sensitivity; CL gives the body-part awareness. Neither alone achieves 6.73°.
Across all tasks, performance improves monotonically with model size. The 5B model consistently outperforms the 1.5B, which outperforms the 0.8B. There is no sign of diminishing returns, suggesting that even larger Sapiens2 models would continue to improve.
MAE (He et al., 2022): The masked autoencoder that Sapiens2 uses as its pixel-space reconstruction objective. MAE showed that masking 75% of patches creates an effective self-supervised objective for ViTs. Sapiens2 inherits this directly but pairs it with contrastive learning to add semantics.
DINO / DINOv2 (Caron et al., 2021; Oquab et al., 2023): The self-distillation framework that Sapiens2 adapts for its CL branch. DINOv2 also combined MIM with contrastive learning (via iBOT), but operated in latent space with aggressive augmentations. Sapiens2 departs by anchoring in pixel space and controlling augmentations.
Sapiens v1 (Khirodkar et al., 2024): The direct predecessor. MAE-only pretraining on 300M human images. Sapiens2 keeps the human-centric focus but upgrades the pretraining (unified loss), dataset (1B images), architecture (GQA, SwiGLU), and resolution (4K hierarchical attention).
iBOT (Zhou et al., 2022): Combined MIM with self-distillation via masked image tokens in latent space. Closest prior hybrid to Sapiens2, but its latent-space MIM + color augmentations caused representation drift for human-centric tasks.
PE-H (Human Pose Encoder): A specialized encoder trained specifically for pose estimation. Sapiens2 outperforms it on pose AND all other tasks, demonstrating the advantage of a general-purpose backbone over task-specific training.
DINOv3-7B: The latest large-scale CL backbone. Despite 37% more parameters, it underperforms Sapiens2-5B on dense probing across all tasks — confirming that domain-specific data (Humans-1B) and pixel-space MAE matter more than raw scale.
Human digitization: Pointmap + normals + albedo from a single image creates a complete 3D human model suitable for relighting, animation, and AR/VR telepresence.
Fitness and health: 308-keypoint pose estimation enables precise body measurement, exercise form analysis, and rehabilitation monitoring.
Fashion and retail: Part segmentation + albedo enables virtual try-on with accurate material appearance.
Vision Transformers — the ViT foundations that Sapiens2 builds on.
DINO — the self-supervised contrastive framework adapted for Sapiens2's CL branch.
Depth Anything V2 — another dense prediction specialist; complementary approach using synthetic data.
Vision Banana — image generators as vision learners; an alternative path to unified representations.