671B parameters, 37B activated per token. Multi-head Latent Attention compresses the KV cache. DeepSeekMoE routes tokens without auxiliary losses. FP8 training from scratch. All for $5.57M and zero loss spikes.
You want to build a frontier language model — one that rivals GPT-4o and Claude 3.5 Sonnet on math, code, and reasoning. The obvious approach is to scale up a dense Transformer: more layers, more heads, more parameters. But you hit two walls immediately.
Wall 1: Training cost. GPT-4 reportedly cost over $100M to train. Llama-3.1 405B required 30.8M GPU hours. At $2/GPU-hour, that's $61.6M just for pre-training. These numbers make frontier models a game only a handful of companies can play.
Wall 2: Inference cost. A 405B dense model activates every parameter on every token. The KV cache alone — storing all the keys and values from previous tokens for attention — grows linearly with context length and model width. For a 405B model serving long contexts, you need expensive multi-GPU setups just to hold the model in memory, let alone run it fast.
DeepSeek-V3 attacks both walls simultaneously. It is a Mixture-of-Experts (MoE) model with 671B total parameters but only 37B activated per token. It introduces Multi-head Latent Attention (MLA) to compress the KV cache by an order of magnitude. And it pioneers FP8 mixed-precision training to halve the memory and compute per operation. The result: a model that matches GPT-4o on major benchmarks, trained for 2.788M H800 GPU hours — roughly $5.57M.
Zero loss spikes. No rollbacks. This is the report of how they did it.
Drag the slider to set total model parameters. Compare the FLOPs per token for a dense model (activates everything) vs. an MoE model (activates only a fraction). The gap grows dramatically with scale.
DeepSeek-V3's design philosophy is a triple decomposition: separate capacity from activation cost (MoE), separate cache size from attention quality (MLA), and separate numerical precision from training stability (FP8). Each decomposition attacks a different cost bottleneck.
Two additional innovations make the system work in practice:
Standard multi-head attention (MHA) stores a key vector and a value vector for every token at every layer and every head. For DeepSeek-V3 with 128 heads and dh = 128, that is 128 × 128 = 16,384 dimensions of keys plus 16,384 dimensions of values per token per layer. For a 128K context, this cache is enormous.
MLA's insight: the keys and values across all heads are highly redundant. Instead of storing them directly, compress them into a much smaller latent vector and reconstruct the full keys and values on the fly.
Given the hidden state ht at position t, MLA first compresses it into a latent vector:
where dc = 512 is the KV compression dimension. Compare this to dhnh = 128 × 128 = 16,384 for standard MHA. That is a 32x compression.
During inference, the KV cache stores only ctKV (512 dims) plus a small RoPE key ktR (64 dims) per token per layer. The full keys and values are reconstructed from the latent when needed:
The up-projection matrices WUK and WUV are learned parameters. They reconstruct the full-dimensional keys and values for all 128 heads from the tiny 512-dim latent.
Rotary Position Embeddings (RoPE) are applied after projection, rotating query and key vectors based on their positions. But RoPE is position-dependent — you can't absorb it into the latent compression because the compression happens before you know the relative positions at attention time.
MLA solves this by maintaining a decoupled RoPE key: a separate, small key vector ktR ∈ R64 that carries RoPE. The final key is a concatenation:
The content key kt,iC comes from the latent (shared across heads after up-projection). The RoPE key ktR is shared across all heads (only 64 dims). So the total KV cache per token per layer is dc + dhR = 512 + 64 = 576 dimensions, down from 32,768 in standard MHA.
Queries get the same low-rank treatment to reduce activation memory during training:
A separate RoPE query is produced and concatenated, matching the key structure. Then standard scaled dot-product attention proceeds as usual.
Drag the KV compression dimension dc to see how cache size changes. The orange bar shows MLA's cache; the gray bar shows standard MHA. Watch the compression ratio explode as dc shrinks.
In a standard Transformer, every token passes through the same feed-forward network (FFN). In a Mixture-of-Experts model, the FFN is replaced by many parallel "expert" FFNs, and a router decides which experts each token visits. Most tokens only activate a few experts, so compute per token stays small even as you add more experts (and more total parameters).
DeepSeek-V3's MoE layer has a distinctive structure:
| Component | Count | Purpose |
|---|---|---|
| Shared experts | 1 per layer | Always active. Captures common knowledge that every token needs. |
| Routed experts | 256 per layer | Specialized. Each token visits its top-8 (Kr=8) by affinity score. |
| Router | 1 per layer | Computes token-expert affinity si,t = sigmoid(utT ei) for routing. |
The output for token t is:
where gi,t is the normalized gating weight: the affinity score si,t divided by the sum of the top-8 scores. A crucial detail: DeepSeek-V3 uses sigmoid for affinities (not softmax over all experts). This means each expert's score is independent — changing one expert's affinity doesn't affect another's.
The biggest practical problem with MoE is load imbalance. If most tokens flock to a few popular experts, those experts become bottlenecks and the rest sit idle. The standard fix is an auxiliary loss that penalizes imbalance. But this loss fights the main training objective — it forces suboptimal routing to achieve balance, degrading model quality.
DeepSeek-V3's solution is elegant: dynamic bias terms. Each routed expert i has a learnable bias bi. When the router decides which experts a token visits, it uses si,t + bi for ranking. But when it computes the actual gating weight (the scalar that multiplies the expert's output), it uses only si,t — the bias is invisible to the forward pass.
At the end of each training step, the system checks expert loads across the entire batch. Overloaded experts get bi decreased by γ. Underloaded experts get bi increased by γ. The bias nudges routing toward balance without ever adding a term to the loss function.
16 experts, top-2 routing. Click "Route Tokens" to send a batch of tokens. Watch how the bias terms (orange bars) adjust to balance the load. Overloaded experts get their bias decreased; underloaded experts get it increased.
Standard language model training uses a next-token prediction objective: at each position t, predict token t+1. This gives one gradient signal per position. DeepSeek-V3 extends this to multi-token prediction (MTP): at position t, predict tokens t+1 and t+2.
Why? Two reasons:
Unlike Gloeckle et al. (2024) who use parallel independent heads, DeepSeek-V3 uses sequential prediction with a causal chain. Here's the setup:
Concretely, the MTP module at depth k takes two inputs: (1) the hidden state htk-1 from the previous depth, and (2) the embedding of the predicted token at depth k-1. These are concatenated, projected through a linear layer, and fed through a single Transformer layer. The output is then passed through the shared output head (same as the main model's language model head) to produce next-token logits.
Standard NTP (gray) predicts one token ahead. MTP (orange) predicts two. Toggle to see how the model sees the sequence at position t — and what it must predict.
Most large model training uses BF16 (bfloat16): 16 bits per number, with 8 bits for exponent and 7 for mantissa. This gives good dynamic range but uses 2 bytes per parameter and per activation. DeepSeek-V3 is the first model at this scale to train with FP8 (8 bits): half the memory, nearly double the throughput.
But FP8 has a fundamental problem: with only 8 bits, you get either decent range (E5M2: 5 exponent, 2 mantissa bits — range of BF16 but almost no precision) or decent precision (E4M3: 4 exponent, 3 mantissa bits — better precision but limited range). Neither is great on its own.
DeepSeek-V3's solution uses FP8 for the expensive parts and higher precision for the sensitive parts:
| Operation | Precision | Why |
|---|---|---|
| Linear layer forward (GEMM) | FP8 (E4M3) | Bulk of compute. FP8 Tensor Cores are 2x faster. |
| Linear layer backward (GEMM) | FP8 (E4M3) | Gradient matmuls also benefit from 2x speedup. |
| Accumulation in GEMM | FP32 | Tiny errors compound over 7168-dim dot products. Must accumulate in high precision. |
| Attention, normalization, routing | BF16 / FP32 | These are numerically sensitive. FP8 here causes instability. |
| Optimizer states | FP32 | Adam momentum and variance need full precision. |
| Master weights | FP32 | Updates are tiny; low precision would lose them. |
A single FP8 scale factor per tensor is too coarse — if one element is 1000x larger than the rest, the scale must accommodate it and the small values get crushed to zero. DeepSeek-V3 uses fine-grained block-wise quantization: for a 2D weight matrix, each 128-element block along the inner dimension gets its own scale factor. For activations, each 1×128 tile gets its own scale. This preserves precision across the wildly varying magnitudes in real neural network tensors.
See how FP8 E4M3 represents numbers compared to BF16. Drag the value slider to see quantization error. Note how block-wise scaling (orange) keeps error small by using a local scale factor, while tensor-wise scaling (gray) loses precision on small values.
Pre-training is where the model consumes raw text and learns the statistical structure of language. For DeepSeek-V3, this stage consumed 2,664K H800 GPU hours — 95.6% of the total training budget. Everything about it was optimized for stability and throughput.
DeepSeek-V3 was pre-trained on 14.8 trillion tokens of diverse, high-quality data. The composition was intentionally tuned:
With 2,048 H800 GPUs connected by InfiniBand and NVLink, the training framework uses:
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW (β1=0.9, β2=0.95) |
| Learning rate | Peak 2.2×10-4, cosine decay |
| Weight decay | 0.1 |
| Batch size ramp | 3072 → 15360 sequences |
| Sequence length | 4096 tokens (pre-training), extended to 32K then 128K |
| Gradient clipping | 1.0 |
After the main pre-training phase, context length was extended in two stages: first to 32K tokens, then to 128K. This used YaRN (Yet another RoPE extension) to rescale the rotary position embeddings. Each stage trained for a short additional period — 119K GPU hours total, a small fraction of the pre-training cost.
The pre-trained base model is a powerful text predictor, but it doesn't follow instructions or align with human preferences. Post-training transforms it into a helpful assistant through two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
DeepSeek-V3 was fine-tuned on 1.5 million instruction-response pairs spanning diverse domains:
During SFT, the MTP objective was maintained — the model continued predicting 2 tokens ahead, which helped the fine-tuned model retain the stronger representations learned during pre-training.
After SFT, RL further aligns the model using Group Relative Policy Optimization (GRPO). Unlike PPO, GRPO doesn't require a separate critic network. Instead, it generates multiple responses per prompt, scores them with a reward model, and updates the policy based on relative rankings within each group.
Two types of reward signals were used:
DeepSeek-V3 was evaluated against both open-source models (Llama 3.1 405B, Qwen 2.5 72B) and closed-source frontier models (GPT-4o, Claude 3.5 Sonnet). The results position it as the strongest open-source model and competitive with the best closed-source systems.
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 | Llama 405B |
|---|---|---|---|---|
| MMLU | 88.5 | 87.2 | 88.3 | 88.6 |
| MMLU-Pro | 75.9 | 72.6 | 78.0 | 73.3 |
| GPQA-Diamond | 59.1 | 49.9 | 65.0 | 51.1 |
| DROP (F1) | 91.6 | 83.7 | 88.3 | 88.7 |
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 | Llama 405B |
|---|---|---|---|---|
| MATH-500 | 90.2 | 74.6 | 78.3 | 73.8 |
| AIME 2024 | 39.2 | 9.3 | 16.0 | 23.3 |
| GSM8K | 95.5 | 92.2 | 95.0 | 89.0 |
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 | Llama 405B |
|---|---|---|---|---|
| HumanEval | 82.6 | 90.2 | 93.7 | 89.0 |
| LiveCodeBench | 40.5 | 34.2 | 38.8 | 28.5 |
| Codeforces (pct.) | 51.6 | 23.6 | 20.3 | 25.3 |
| SWE-bench Verified | 42.0 | 38.8 | 50.8 | 33.4 |
Performance across six key benchmarks. DeepSeek-V3 (orange) vs GPT-4o (gray) vs Claude 3.5 Sonnet (teal). Higher is better.
DeepSeek-V3 sits at the intersection of several major research threads. Let's map where each innovation came from and where it leads.
Multi-Query Attention (MQA, Shazeer 2019) first proposed sharing K,V across heads to reduce cache. Grouped-Query Attention (GQA, Ainslie et al. 2023) added groups of heads sharing K,V. MLA goes further: instead of sharing, it compresses all K,V into a joint latent. This is strictly more powerful — the up-projection can learn different K,V for each head from the same latent. MLA first appeared in DeepSeek-V2 and is retained unchanged in V3.
Shazeer et al. (2017) introduced modern MoE for Transformers with GShard. Switch Transformers (Fedus et al. 2021) simplified to top-1 routing. Mixtral (Jiang et al. 2024) demonstrated competitive dense-model performance. DeepSeekMoE (Dai et al. 2024) introduced fine-grained experts (256 small experts vs. 8 large ones) and shared experts. The auxiliary-loss-free balancing in V3 is the latest evolution — solving the quality-vs-balance trade-off that plagued all prior MoE work.
DeepSeek-R1 is a separate model family focused on long chain-of-thought reasoning (trained with RL to produce step-by-step solutions). V3's post-training distills R1's reasoning patterns into a standard-length output model. This "best of both worlds" approach — R1-quality reasoning without R1-length outputs — is a major practical innovation.
Moonshot AI's Kimi K2 (2025) builds on many of the same ideas: MoE architecture, auxiliary-loss-free balancing (citing DeepSeek-V3), and MTP. K2 extends the approach with Muon optimizer and DAPO for RL, representing the next generation of this design family.
| Aspect | DeepSeek-V3 |
|---|---|
| Total parameters | 671B (37B activated per token) |
| Architecture | Transformer + MLA + DeepSeekMoE |
| Attention | MLA: dc=512, d'c=1536, 128 heads |
| MoE | 1 shared + 256 routed experts, top-8 |
| Load balancing | Auxiliary-loss-free (dynamic bias) |
| Training objective | NTP + MTP (λ=0.3, depth=1) |
| Training precision | FP8 mixed (linear layers) + BF16/FP32 |
| Pre-training tokens | 14.8T |
| Context length | 128K (extended from 4K via YaRN) |
| Training cost | 2.788M H800 GPU-hours ($5.57M) |
| Hardware | 2,048 H800 GPUs |
| Parallelism | Pipeline (DualPipe) + Expert (no tensor) |
| Training stability | Zero loss spikes, zero rollbacks |
| Post-training | SFT (1.5M pairs) + GRPO + R1 distillation |
| Key results | MATH-500: 90.2%, AIME: 39.2%, Codeforces: 51.6 pct |