Sapiens 2 — Veanors

Chapter 0: The Problem

You want to build a single vision backbone that handles every task involving humans: estimating where their 308 body, face, and hand keypoints are; segmenting every body part down to individual fingers; predicting the 3D surface normal at each pixel; reconstructing the full 3D shape as a pointmap; and recovering the intrinsic skin and clothing color under arbitrary lighting.

Each of these tasks has a different relationship with image information. Pose estimation cares about semantics — recognizing that a blob of pixels is an elbow, regardless of texture or color. Surface normals care about low-level detail — the subtle shading gradients that reveal curvature. Albedo needs appearance fidelity — the actual pixel colors, stripped of lighting effects.

This creates a fundamental tension in how you pretrain your backbone.

Two pretraining philosophies, two failure modes

Masked Image Modeling (MIM) — exemplified by MAE — masks 75% of an image's patches and trains the model to reconstruct the missing pixels. This forces the encoder to learn fine-grained spatial relationships: texture patterns, color distributions, edge orientations. MAE-pretrained models excel at dense prediction tasks (segmentation, normals) but struggle at semantic understanding (is this a hand or a foot?).

Contrastive Learning (CL) — exemplified by DINO and DINOv2 — trains the model to produce similar embeddings for different augmented views of the same image. This creates representations rich in high-level semantics (identity, pose, action). But aggressive augmentations (random crops, color jitter) deliberately destroy low-level appearance cues. DINO features are great for classification but throw away the texture and color details that normals and albedo require.

The original Sapiens (v1) used MAE-only pretraining. It captured pixel-level structure well, but its semantic understanding was limited — especially for tasks like pose estimation that need the model to "know" which body part it is looking at.

The core tension: Human-centric vision needs BOTH pixel fidelity (for normals, albedo, segmentation boundaries) AND semantic understanding (for pose, part recognition, action). MAE gives you one, contrastive learning gives you the other. Previous models had to pick a side. Prior hybrids like iBOT and DINOv2 combined them in latent space, but aggressive augmentations caused "representation drift" — eroding the appearance cues that human-centric tasks critically need.

MIM vs CL Tradeoff

Toggle between pretraining methods. MAE learns low-level structure (texture, edges), CL learns high-level semantics (identity, pose). Neither alone covers all tasks. Watch how each method scores on five human-centric tasks.

Why does MAE-only pretraining struggle with pose estimation?

Because MAE models are too small Because MAE learns low-level pixel reconstruction but not the semantic understanding needed to identify which body part a patch belongs to Because MAE cannot process high-resolution images

Chapter 1: The Key Insight

Sapiens2's insight is to unify MAE and contrastive learning in a single pretraining objective — but in a very specific way that avoids the pitfalls of prior hybrids.

The combined loss is deceptively simple:

L = L_MAE + λ · L_CL

But the devil is in the details. Prior hybrids like iBOT and DINOv2 also combined MIM with contrastive learning. They failed for human-centric tasks because they operated entirely in latent space — predicting cluster tokens or teacher embeddings instead of actual pixels. This, combined with aggressive color and geometric augmentations, caused representation drift: the features gradually lost the appearance information (color, texture) that tasks like albedo estimation and surface normals depend on.

Sapiens2's two anchors

Anchor 1 — MAE in pixel space: The reconstruction loss operates on actual pixel values, not latent embeddings. This forces the encoder to preserve fine-grained appearance information at every layer. You cannot predict exact pixel colors if your features have drifted away from the low-level signal.

Anchor 2 — No color augmentation on global views: The contrastive loss uses a student-teacher framework with multiple augmented views. But critically, Sapiens2 does NOT apply color augmentation (jitter, grayscale) to the global views used for MAE reconstruction. This preserves the color distribution that is essential for albedo and segmentation. Local crops still get augmented for robustness.

The key departure from DINOv2/iBOT: Reconstruct in PIXEL space, not latent space. Pixels are the anchor that prevents representation drift. While CL organizes features semantically (elbow vs. knee), MAE reconstruction keeps those same features grounded in actual appearance (the skin color, the fabric texture, the shadow direction). Neither objective alone provides both. Together, with careful augmentation control, they do.

Prior hybrids

MIM in latent space + CL with aggressive augmentations → representation drift → appearance cues lost

↓

Sapiens2

MAE in PIXEL space + CL with no color aug on global views → pixel anchor prevents drift → both appearance and semantics preserved

The practical result: a single pretrained backbone where early-layer features capture texture and color (useful for normals, albedo), middle-layer features capture spatial structure (useful for segmentation), and deep-layer features capture semantics (useful for pose). All five downstream tasks can tap the same encoder at different depths.

How does Sapiens2 prevent the "representation drift" that plagued prior MIM+CL hybrids?

By reconstructing in PIXEL space (not latent) and avoiding color augmentation on global views, anchoring features to actual appearance while CL organizes them semantically By using a larger model By training for more epochs

Chapter 2: Humans-1B

Self-supervised pretraining is only as good as the data it trains on. Sapiens2 introduces Humans-1B: one billion high-quality human images curated from a pool of four billion web images. No task labels. No bounding boxes. No keypoint annotations. Just humans, in diverse settings, at high resolution.

The multi-stage curation pipeline

Starting from 4B web images, the pipeline applies six sequential filters:

Stage 1: Human Detection

Run a person bounding-box detector. Keep only images where at least one human is detected with high confidence. This eliminates landscapes, product photos, and other non-human content.

↓

Stage 2: Head Pose

Estimate head pose for detected humans. Filter out extreme back-of-head or heavily occluded viewpoints. Ensures face and upper-body visibility for learning anatomical structure.

↓

Stage 3: Aesthetic Scoring

Score images for visual quality and realism. Remove low-resolution, heavily compressed, watermarked, or cartoon/anime images. The backbone should learn from photographic quality.

↓

Stage 4: CLIP Features

Extract CLIP embeddings. Score alignment with human-centric text prompts. Remove images where humans are incidental background elements rather than the primary subject.

↓

Stage 5: Text Overlay

Detect and remove images with significant text overlays (memes, ads, stock photo watermarks). OCR-heavy images confuse patch reconstruction during MAE pretraining.

↓

Stage 6: Dedup + Balance

Near-duplicate removal using perceptual hashing, then cluster-balanced sampling: group images by pose, activity, and appearance clusters, then sample uniformly to prevent over-representation of common poses (e.g., standing front-facing).

Why cluster-balanced sampling matters: Without balancing, the dataset would be dominated by front-facing portrait photos (selfies, ID photos, social media). The model would underperform on rare but important poses: side views, crouching, hands above head, lying down. Cluster-balanced sampling ensures the pretraining distribution covers the full space of human configurations.

The result: 1 billion images spanning every ethnicity, age group, body type, clothing style, activity, and environment. This is the largest human-centric pretraining dataset ever assembled — 3x larger than Sapiens v1's 300M images. And crucially: no task-specific labels and no human-specific priors are injected during pretraining. The model learns human structure purely from self-supervised reconstruction and view matching.

Why does Humans-1B use cluster-balanced sampling instead of random sampling?

Because random sampling is slower Because without balancing, common poses (like front-facing portraits) would dominate and the model would underperform on rare but important poses like side views, crouching, or hands above head Because it reduces dataset size

Chapter 3: Pretraining

This is the heart of Sapiens2. Every image passes through two parallel objectives simultaneously: pixel-level reconstruction and cross-view contrastive matching.

Step 1: Generate views and mask

Given an input image x, the system generates V augmented views {x₁, x₂, ..., x_V}. Two are "global" views (large crops, 224x224) and the rest are "local" views (smaller crops, 96x96). For each view, 75% of patches are randomly masked — the standard MAE ratio.

Step 2: MAE branch (pixel reconstruction)

For each view, only the visible 25% of patches enter the encoder. After encoding, mask tokens are scattered back to their original positions. A lightweight decoder reconstructs the full image. The loss is MSE over masked patches only:

L_MAE = (1/V) Σ_v (1/|M_v|) Σ_{p ∈ M_v} ||x̃_p − x̂_p||²

Where M_v is the set of masked patches in view v, x̃_p is the true pixel patch, and x̂_p is the predicted reconstruction. Computing loss only on masked patches forces the encoder to infer missing structure from visible context — not just copy nearby pixels.

Step 3: CL branch (semantic matching)

The [CLS] token from each view is extracted and projected through a small head into K logits, then softmaxed into a probability distribution. A student network processes the views; an EMA teacher (τ = 0.992) provides targets. The contrastive loss is cross-entropy between student and teacher distributions across different views:

L_CL = (1/|S|) Σ_{(i,j) ∈ S} H(q_j, p_i)

Where S is the set of cross-view pairs (student view i matched to teacher view j, i ≠ j), p_i is the student's softmax distribution, and q_j is the teacher's (sharpened) distribution. This forces the encoder to produce [CLS] representations that are invariant to viewpoint and crop — capturing what is in the image, not where.

The critical augmentation detail

Here is where Sapiens2 departs from DINOv2. Global views receive geometric augmentations (random crop, flip) but NO color augmentations (no jitter, no grayscale, no solarize). Local views receive both. Why? Because the MAE branch reconstructs pixels from global views. If you jitter the colors of the global view, the MAE loss forces the encoder to learn color-jittered representations — destroying the color fidelity needed for albedo and segmentation.

This single design choice — no color aug on global views — is what separates Sapiens2 from DINOv2. DINOv2 aggressively color-jitters all views because it operates in latent space where color doesn't matter. Sapiens2 reconstructs actual pixels where color matters critically. The MAE loss acts as a "color anchor" that prevents the contrastive objective from washing out appearance information.

Pretraining Pipeline (SHOWCASE)

Toggle between "MAE Only" (Sapiens v1), "CL Only" (DINOv2), and "Combined" (Sapiens2). Watch how each method processes an image and what features it learns. The PCA visualization shows learned features: texture-only vs semantic-only vs both.

The total loss combines both branches:

L = L_MAE + λ · L_CL

Where λ balances the two objectives. In practice, the gradients from both losses flow through the same encoder, creating features that simultaneously encode pixel-level detail (for reconstruction) and semantic content (for view matching).

Why does Sapiens2 avoid color augmentation on global views?

To reduce training time Because the MAE branch reconstructs actual pixels from global views — color jitter would force the encoder to learn distorted color representations, destroying the appearance fidelity needed for albedo and segmentation Because color augmentation causes training instability

Chapter 4: Architecture

Sapiens2 is a plain Vision Transformer with several modern upgrades. The family spans four sizes, from 0.4B to 5.1B parameters.

Model family

Model	Params	Hidden	Layers	Heads	FLOPs
Sapiens2-0.4B	0.4B	1024	24	16	1.26T
Sapiens2-0.8B	0.8B	1280	32	16	2.59T
Sapiens2-1B	1.5B	1536	40	24	4.72T
Sapiens2-5B	5.1B	2432	56	32	15.7T

Architecture upgrades over Sapiens v1

Grouped Query Attention (GQA): The middle layers use GQA — sharing key/value heads across multiple query heads. This reduces memory and compute while maintaining representation quality. GQA is applied to the middle third of layers where attention patterns are most redundant.

SwiGLU FFN: The standard MLP (Linear → GELU → Linear) is replaced with SwiGLU (Linear × Swish(Linear) → Linear). This gated activation consistently outperforms GELU across model sizes. The FFN hidden dimension is set to 8/3 × hidden_dim to match parameter count.

QK-Norm: Query and key projections are normalized before computing attention scores. This prevents attention logit explosion in deep networks (56 layers for the 5B model), stabilizing training without learning rate warmup tricks.

RMSNorm: Layer normalization is replaced with RMSNorm throughout. RMSNorm drops the mean-centering step, keeping only the variance normalization. This is ~10-15% faster than LayerNorm with equivalent quality.

PixelShuffle decoder: The MAE decoder upsamples using PixelShuffle rather than transposed convolutions. PixelShuffle rearranges channel dimensions into spatial dimensions, producing sharper reconstructions without checkerboard artifacts.

Why these specific upgrades? Each addresses a scaling bottleneck. GQA reduces the O(n²) memory cost of attention at 4K resolution. SwiGLU improves parameter efficiency so deeper models learn faster. QK-Norm prevents training instability in 56-layer networks. RMSNorm saves wall-clock time at every layer. PixelShuffle improves reconstruction quality without extra parameters. Together, they enable the 5B model to train stably at 4K resolution.

Model Family Scaling

Compare model sizes. Hover/tap a model to see its specs. FLOPs grow faster than parameters due to increased sequence length at higher resolutions.

Why is GQA applied specifically to the middle layers of Sapiens2?

Because the middle layers have the most redundant attention patterns, so sharing key/value heads across query heads saves memory and compute without losing quality Because middle layers are the slowest Because GQA only works in middle layers

Chapter 5: 4K Hierarchical Attention

Human-centric tasks demand high resolution. A 4K image (3840x2160) with 16x16 patches produces over 32,000 tokens. Standard self-attention at O(n²) would need ~1 trillion operations per layer. That is completely impractical, even at inference time.

Sapiens2 solves this with a two-stage attention scheme that processes local structure first, then global context.

Stage 1: Windowed self-attention (first K layers)

The first K layers use windowed self-attention. The token grid is partitioned into non-overlapping spatial windows (e.g., 14x14 patches each). Each window attends only within itself. This is O(n · w²) instead of O(n²), where w is the window size. These layers learn local structure: texture patterns, edge orientations, color gradients — information that is inherently local.

CLS-guided pooling

After the local stage, a pooling operation downsamples the token grid by stride ω. This reduces the sequence length by ω². For example, with ω=2 and 32,000 input tokens, pooling produces 8,000 tokens. The [CLS] token guides the pooling to preserve the most informative spatial positions.

Stage 2: Global self-attention (remaining L layers)

The remaining L layers operate on the downsampled token sequence with full global self-attention. Now at 8,000 tokens (post-pooling), global attention is feasible. These layers learn long-range relationships: the left hand relates to the right shoulder; the person's pose informs which way their torso faces; the background context helps resolve ambiguous body configurations.

Why this hierarchy works for MAE: Masking is applied AFTER the local windowed stage. Since each local window is self-contained, masked patches in one window don't "leak" information from another. At the global stage, the already-downsampled tokens interact freely. This means MAE reconstruction uses local-window context for texture and global context for structure — exactly the right inductive bias for humans.

The decoder receives the global features, upsamples back to the local resolution, and outputs at 2K resolution (half the input). For tasks requiring the full 4K, a simple bilinear upsampling is applied at post-training time.

Hierarchical Attention

Drag the slider to move through the network. First K layers use local windows (highlighted boxes). After pooling, remaining L layers use global attention (all-to-all connections).

Layer depth Local

Why is masking applied after the local windowed attention stage rather than before?

To speed up training Because local windows are self-contained — masking after them ensures no information leakage across windows, preserving the MAE reconstruction challenge at the global stage Because the model cannot handle masks during local attention

Chapter 6: Post-Training

After pretraining, the backbone is frozen. Five lightweight task-specific heads are trained independently, each converting the universal features into a specific output format. This is the "foundation model" paradigm: one backbone, many tasks.

Task 1: Pose estimation (308 keypoints)

Predict 2D coordinates for 308 keypoints: 25 body, 40 hands (20 per hand), and 243 face landmarks. The head is a simple deconvolution stack that produces per-keypoint heatmaps. Loss: MSE with Online Hard Example Mining (OHEM) — the loss focuses on the hardest 30% of keypoints per image (occluded joints, ambiguous poses), preventing the model from coasting on easy visible keypoints.

Task 2: Part segmentation (29 classes)

Classify each pixel into one of 29 body parts (head, torso, upper arm left, forearm left, hand left, etc.). The head is a standard segmentation decoder. Loss: cross-entropy + Dice loss. Dice loss directly optimizes the IoU metric, preventing class imbalance from dominating (tiny parts like fingers get equal weight to large parts like torso).

Task 3: Pointmap (per-pixel XYZ)

Predict the 3D position (X, Y, Z) of each pixel in a focal-normalized coordinate frame. This produces a dense 3D reconstruction of the person from a single image. Loss: L2 + gradient loss. The gradient loss penalizes differences in spatial derivatives — ensuring smooth surfaces without discontinuities at patch boundaries.

Task 4: Surface normals (unit vectors)

Predict the surface normal direction (3D unit vector) at each pixel. Normals encode local surface curvature: flat surfaces have uniform normals; wrinkles, folds, and muscle definition create rapidly varying normals. Loss: (1 - cos θ) + L2 + gradient loss. The cosine term directly optimizes angular accuracy; L2 ensures magnitude stability; gradient loss preserves sharp creases.

Task 5: Albedo (intrinsic color)

Predict the intrinsic diffuse color at each pixel — what the surface looks like under uniform white light, stripping away shadows, specular highlights, and ambient lighting. Loss: L2 + gradient + mean-color alignment. The mean-color alignment term prevents the model from predicting a globally shifted color palette (e.g., everything too warm or too cool).

Why lightweight heads work: The frozen backbone already encodes everything each task needs. Pose uses the deep semantic features to identify body parts. Normals use the early texture features to detect surface curvature. Albedo uses the preserved color information (thanks to the no-color-aug MAE anchor). The heads just decode what is already there. Training each head takes a fraction of the pretraining cost.

Why does the pose estimation head use OHEM (Online Hard Example Mining)?

To reduce GPU memory usage To focus the loss on the hardest keypoints (occluded joints, ambiguous poses) and prevent the model from coasting on easy visible keypoints Because 308 keypoints is too many to train at once

Chapter 7: Dense Probing

How do you measure whether a pretrained backbone has actually learned universal human representations? You probe it: freeze the backbone, attach a minimal linear head, and evaluate on each task. If the backbone is good, even a linear probe should perform well — because the features already encode the right information.

The probing protocol

For each of the five tasks (pose, segmentation, normals, pointmap, albedo), the paper trains a single linear layer on top of frozen backbone features. No decoder stack. No skip connections. No fine-tuning. Just a matrix multiply from feature dimension to output dimension. This is the harshest possible test of feature quality.

The competition

Sapiens2-5B is compared against the best available pretrained vision backbones:

DINOv2-G (1.1B params) — the gold standard for self-supervised ViT features
DINOv3-7B (7B params) — the latest scale-up, ~37% more parameters than Sapiens2-5B
PE-H — a human-specialized encoder trained on pose datasets
Sapiens v1 (MAE-only) — the direct predecessor

Sapiens2-5B beats ALL baselines on EVERY task in dense probing. This is remarkable because DINOv3-7B has 37% more parameters. The advantage comes from the unified pretraining: MAE provides the low-level features that dense probing needs, while CL provides the semantic organization that helps the linear probe find the right mapping. Neither MAE-only nor CL-only backbones achieve this.

The results are especially dramatic for tasks that need both semantics and detail. On normals, linear probing of Sapiens2 features achieves angular errors competitive with full fine-tuned models. On segmentation, the linear probe mIoU exceeds many full models. This means the features are so well-organized that a linear separator can carve them into 29 body parts with high accuracy.

Dense Probing Results

Compare backbone quality across tasks via linear probing. Higher bars = better features. Sapiens2 leads on every task despite DINOv3 having 37% more parameters.

Why does dense probing with a linear head (no fine-tuning) provide a fair comparison of backbone quality?

Because it requires more compute Because a linear probe can only succeed if the features already encode the right information — there's no decoder capacity to compensate for poor features, making it the harshest test of backbone quality Because it is faster to train

Chapter 8: Results

Sapiens2 achieves state-of-the-art across all five human-centric tasks, with massive improvements over Sapiens v1 in every category.

Head-to-head: Sapiens v1 vs v2

Task	Metric	Sapiens v1	Sapiens2-5B	Improvement
Pose	mAP	78.3	82.3	+4.0
Segmentation	mIoU	58.2	82.5	+24.3
Normals	Angular error	12.3°	6.73°	-45.6%
Pointmap	MPJPE	—	SOTA	Beats VGGT, MoGe, DUSt3R
Albedo	MAE / PSNR	—	0.012 / 32.6	New task (no v1 baseline)

The segmentation jump explained

The +24.3 mIoU improvement on segmentation is the most dramatic result in the paper. Why such a massive gain? Segmentation requires recognizing which body part each pixel belongs to — a fundamentally semantic task. MAE-only pretraining (v1) learned spatial structure but not semantic categories. Adding contrastive learning teaches the model that an elbow is an elbow regardless of skin color, sleeve length, or viewing angle. The CL branch provides exactly the semantic feature organization that segmentation needs.

Normals improvement explained

The 45.6% reduction in angular error is equally telling. Surface normals depend on subtle shading gradients — low-level appearance cues. But they also need semantic context: the model must know it is looking at a shoulder to correctly predict the curvature, because shoulders have characteristic geometry. MAE gives the shading sensitivity; CL gives the body-part awareness. Neither alone achieves 6.73°.

Scaling behavior

Across all tasks, performance improves monotonically with model size. The 5B model consistently outperforms the 1.5B, which outperforms the 0.8B. There is no sign of diminishing returns, suggesting that even larger Sapiens2 models would continue to improve.

The unified pretraining payoff: Every task improved from the same pretraining change (adding CL to MAE). Tasks that needed more semantics (segmentation: +24.3 mIoU) improved more than tasks already well-served by MAE (normals: 45.6% better). But even pose, which was already decent with MAE-only, gained +4 mAP from better semantic features. This confirms that the two objectives are genuinely complementary, not competing.

Why did segmentation improve by +24.3 mIoU (the largest gain) when adding contrastive learning?

Because segmentation classifies each pixel into a body part — a fundamentally semantic task. MAE-only pretraining learned spatial structure but not semantic categories; adding CL teaches the model that an elbow is an elbow regardless of appearance Because the segmentation head was redesigned Because of higher resolution inputs

Chapter 9: Connections

What Sapiens2 built on

MAE (He et al., 2022): The masked autoencoder that Sapiens2 uses as its pixel-space reconstruction objective. MAE showed that masking 75% of patches creates an effective self-supervised objective for ViTs. Sapiens2 inherits this directly but pairs it with contrastive learning to add semantics.

DINO / DINOv2 (Caron et al., 2021; Oquab et al., 2023): The self-distillation framework that Sapiens2 adapts for its CL branch. DINOv2 also combined MIM with contrastive learning (via iBOT), but operated in latent space with aggressive augmentations. Sapiens2 departs by anchoring in pixel space and controlling augmentations.

Sapiens v1 (Khirodkar et al., 2024): The direct predecessor. MAE-only pretraining on 300M human images. Sapiens2 keeps the human-centric focus but upgrades the pretraining (unified loss), dataset (1B images), architecture (GQA, SwiGLU), and resolution (4K hierarchical attention).

Related approaches

iBOT (Zhou et al., 2022): Combined MIM with self-distillation via masked image tokens in latent space. Closest prior hybrid to Sapiens2, but its latent-space MIM + color augmentations caused representation drift for human-centric tasks.

PE-H (Human Pose Encoder): A specialized encoder trained specifically for pose estimation. Sapiens2 outperforms it on pose AND all other tasks, demonstrating the advantage of a general-purpose backbone over task-specific training.

DINOv3-7B: The latest large-scale CL backbone. Despite 37% more parameters, it underperforms Sapiens2-5B on dense probing across all tasks — confirming that domain-specific data (Humans-1B) and pixel-space MAE matter more than raw scale.

What Sapiens2 enables

Human digitization: Pointmap + normals + albedo from a single image creates a complete 3D human model suitable for relighting, animation, and AR/VR telepresence.

Fitness and health: 308-keypoint pose estimation enables precise body measurement, exercise form analysis, and rehabilitation monitoring.

Fashion and retail: Part segmentation + albedo enables virtual try-on with accurate material appearance.

Sapiens2's lasting insight: When your downstream tasks span the full spectrum from pixel-level to semantic, you need a pretraining objective that covers that same spectrum. Anchoring in pixel space while organizing semantically is the recipe. The 1B human-centric dataset shows that domain-specific data at scale can outcompete domain-general data at even larger scale.

Cheat sheet

Core recipe

L = L_MAE + λ · L_CL — reconstruct pixels while matching views across a student-teacher framework

Key numbers

1B images • 0.4B-5.1B params • 1K-4K resolution • 308 keypoints • 29 body parts • 5 tasks

Architecture

ViT + GQA + SwiGLU + QK-Norm + RMSNorm + PixelShuffle • 4K hierarchical attention (windowed → pool → global)

Critical detail

No color augmentation on global views — the pixel-space MAE anchor that prevents representation drift

Impact

SOTA on ALL 5 human-centric tasks • +24.3 mIoU segmentation • 45.6% lower normal error • Beats DINOv3-7B with 37% fewer params

Explore further

Vision Transformers — the ViT foundations that Sapiens2 builds on.

DINO — the self-supervised contrastive framework adapted for Sapiens2's CL branch.

Depth Anything V2 — another dense prediction specialist; complementary approach using synthetic data.

Vision Banana — image generators as vision learners; an alternative path to unified representations.

What advantage does Sapiens2's domain-specific Humans-1B dataset provide over DINOv3's larger but domain-general training set?

Domain-specific data at scale means every training image teaches the model about human structure, making features more efficient for human-centric tasks even with fewer parameters — Sapiens2-5B beats DINOv3-7B despite being 37% smaller Domain-specific data is cheaper to collect Domain-specific data trains faster