DeepSeek-V3 — Veanors

Chapter 0: The Problem

You want to build a frontier language model — one that rivals GPT-4o and Claude 3.5 Sonnet on math, code, and reasoning. The obvious approach is to scale up a dense Transformer: more layers, more heads, more parameters. But you hit two walls immediately.

Wall 1: Training cost. GPT-4 reportedly cost over $100M to train. Llama-3.1 405B required 30.8M GPU hours. At $2/GPU-hour, that's $61.6M just for pre-training. These numbers make frontier models a game only a handful of companies can play.

Wall 2: Inference cost. A 405B dense model activates every parameter on every token. The KV cache alone — storing all the keys and values from previous tokens for attention — grows linearly with context length and model width. For a 405B model serving long contexts, you need expensive multi-GPU setups just to hold the model in memory, let alone run it fast.

The core tension: Capacity (total parameters) drives quality. But activation cost (parameters used per token) drives both training and inference expense. A 671B model that activates only 37B per token gets the capacity of a massive model at the cost of a medium one — if you can make the routing, attention, and training efficient enough.

DeepSeek-V3 attacks both walls simultaneously. It is a Mixture-of-Experts (MoE) model with 671B total parameters but only 37B activated per token. It introduces Multi-head Latent Attention (MLA) to compress the KV cache by an order of magnitude. And it pioneers FP8 mixed-precision training to halve the memory and compute per operation. The result: a model that matches GPT-4o on major benchmarks, trained for 2.788M H800 GPU hours — roughly $5.57M.

Zero loss spikes. No rollbacks. This is the report of how they did it.

Dense vs. MoE Cost

Drag the slider to set total model parameters. Compare the FLOPs per token for a dense model (activates everything) vs. an MoE model (activates only a fraction). The gap grows dramatically with scale.

Total Parameters (B) 671B

Why does a 671B MoE model cost less to run per token than a 405B dense model?

Because the MoE activates only 37B of its 671B parameters per token, while the dense model activates all 405B — so per-token compute is roughly 10x lower despite having more total parameters Because MoE models use fewer layers Because MoE models skip the attention mechanism

Chapter 1: The Key Insight

DeepSeek-V3's design philosophy is a triple decomposition: separate capacity from activation cost (MoE), separate cache size from attention quality (MLA), and separate numerical precision from training stability (FP8). Each decomposition attacks a different cost bottleneck.

MoE: Capacity without Cost

671B parameters give the model massive knowledge capacity. But only 37B are active per token (8 routed experts + 1 shared expert per layer). Training cost scales with activated params, not total params.

↓

MLA: Small Cache, Full Quality

Standard multi-head attention stores K and V for every head at every position — enormous cache. MLA compresses K,V into a low-dimensional latent vector c. Cache stores c (512 dims) instead of full K,V (16,384 dims). Reconstructs K,V on the fly during attention.

↓

FP8: Half the Bits, Same Quality

Most training uses BF16 (16-bit). DeepSeek-V3 trains in FP8 (8-bit) for the heavy linear layers. This halves memory and nearly doubles throughput. But FP8 has tiny dynamic range — requires careful block-wise quantization and high-precision accumulators.

↓

Result: $5.57M for a Frontier Model

2.788M H800 GPU hours total. Pre-training on 14.8T tokens took less than 2 months on 2,048 GPUs. Per trillion tokens: only 180K GPU-hours (3.7 days on the cluster).

Why these three together? MoE alone reduces per-token FLOPs but creates a communication bottleneck (routing tokens to remote experts). FP8 alone risks training instability at scale. MLA alone doesn't reduce training cost. Together, they compound: MoE cuts FLOPs, MLA cuts memory, FP8 cuts both — enabling a model that trains faster, serves cheaper, and performs better than any open-source competitor.

Two additional innovations make the system work in practice:

Auxiliary-loss-free load balancing: Traditional MoE models add a loss term to force tokens to spread evenly across experts. This hurts model quality. DeepSeek-V3 instead uses a dynamic bias trick — adjusting a per-expert bias at each training step to nudge routing toward balance, without ever touching the training objective.
Multi-token prediction (MTP): Instead of predicting just the next token, each position also predicts the token after that. This densifies the training signal (2x the gradient information per sequence) and forces the model to plan ahead in its representations.

Architecture summary: 61 Transformer layers | 128 attention heads | d_model = 7168 | MLA with d_c = 512 (KV compression), d'_c = 1536 (query compression), d_h^R = 64 (RoPE) | DeepSeekMoE: 1 shared expert + 256 routed experts per layer, top-8 routing | Sigmoid gating (not softmax) | Vocabulary: 129,280 tokens.

What are the three cost bottlenecks that DeepSeek-V3's architecture simultaneously attacks?

Latency, throughput, and accuracy Per-token FLOPs (MoE), KV cache memory (MLA), and numerical precision cost (FP8) — three independent bottlenecks addressed by three independent mechanisms Data loading, gradient computation, and optimizer state

Chapter 2: Multi-head Latent Attention

Standard multi-head attention (MHA) stores a key vector and a value vector for every token at every layer and every head. For DeepSeek-V3 with 128 heads and d_h = 128, that is 128 × 128 = 16,384 dimensions of keys plus 16,384 dimensions of values per token per layer. For a 128K context, this cache is enormous.

MLA's insight: the keys and values across all heads are highly redundant. Instead of storing them directly, compress them into a much smaller latent vector and reconstruct the full keys and values on the fly.

The Compression

Given the hidden state h_t at position t, MLA first compresses it into a latent vector:

c_t^KV = W^DKV h_t ∈ R^d_c

where d_c = 512 is the KV compression dimension. Compare this to d_hn_h = 128 × 128 = 16,384 for standard MHA. That is a 32x compression.

During inference, the KV cache stores only c_t^KV (512 dims) plus a small RoPE key k_t^R (64 dims) per token per layer. The full keys and values are reconstructed from the latent when needed:

k_t^C = W^UK c_t^KV ∈ R^d_hn_h
v_t^C = W^UV c_t^KV ∈ R^d_hn_h

The up-projection matrices W^UK and W^UV are learned parameters. They reconstruct the full-dimensional keys and values for all 128 heads from the tiny 512-dim latent.

Why RoPE Needs Special Treatment

Rotary Position Embeddings (RoPE) are applied after projection, rotating query and key vectors based on their positions. But RoPE is position-dependent — you can't absorb it into the latent compression because the compression happens before you know the relative positions at attention time.

MLA solves this by maintaining a decoupled RoPE key: a separate, small key vector k_t^R ∈ R⁶⁴ that carries RoPE. The final key is a concatenation:

k_t,i = [k_t,i^C ; k_t^R]

The content key k_t,i^C comes from the latent (shared across heads after up-projection). The RoPE key k_t^R is shared across all heads (only 64 dims). So the total KV cache per token per layer is d_c + d_h^R = 512 + 64 = 576 dimensions, down from 32,768 in standard MHA.

Full MLA data flow: h_t ∈ R⁷¹⁶⁸ → W^DKV → c_t^KV ∈ R⁵¹² (this is cached) | c_t^KV → W^UK → k_t^C ∈ R¹⁶³⁸⁴ (reconstructed, not cached) | c_t^KV → W^UV → v_t^C ∈ R¹⁶³⁸⁴ (reconstructed, not cached) | h_t → W^KR → RoPE → k_t^R ∈ R⁶⁴ (this is cached). Cache per token per layer: 512 + 64 = 576 floats. Standard MHA cache: 2 × 16,384 = 32,768 floats. Compression ratio: ~57x.

Queries get the same low-rank treatment to reduce activation memory during training:

c_t^Q = W^DQ h_t ∈ R¹⁵³⁶ → W^UQ → q_t^C ∈ R¹⁶³⁸⁴

A separate RoPE query is produced and concatenated, matching the key structure. Then standard scaled dot-product attention proceeds as usual.

MLA Compression Visualizer

Drag the KV compression dimension d_c to see how cache size changes. The orange bar shows MLA's cache; the gray bar shows standard MHA. Watch the compression ratio explode as d_c shrinks.

d_c (KV compression dim) 512

The key trick for inference: During generation, the model only needs to load c_t^KV from the cache. The up-projection (W^UK and W^UV) can be absorbed into the query projection and output projection matrices respectively, so you never actually reconstruct the full K,V tensors. The attention is computed directly on the latent. This makes MLA not just memory-efficient but also compute-efficient during decoding.

In MLA, what is stored in the KV cache per token per layer?

The compressed latent c_t^KV (512 dims) plus the decoupled RoPE key k_t^R (64 dims) — a total of 576 dimensions instead of the 32,768 needed by standard MHA The full key and value vectors for all 128 heads Only the query vectors, since keys and values are recomputed

Chapter 3: DeepSeekMoE

In a standard Transformer, every token passes through the same feed-forward network (FFN). In a Mixture-of-Experts model, the FFN is replaced by many parallel "expert" FFNs, and a router decides which experts each token visits. Most tokens only activate a few experts, so compute per token stays small even as you add more experts (and more total parameters).

DeepSeek-V3's MoE layer has a distinctive structure:

Component	Count	Purpose
Shared experts	1 per layer	Always active. Captures common knowledge that every token needs.
Routed experts	256 per layer	Specialized. Each token visits its top-8 (K_r=8) by affinity score.
Router	1 per layer	Computes token-expert affinity s_i,t = sigmoid(u_t^T e_i) for routing.

The output for token t is:

h'_t = u_t + FFN_shared(u_t) + Σ_{i ∈ top-8} g_i,t · FFN_i(u_t)

where g_i,t is the normalized gating weight: the affinity score s_i,t divided by the sum of the top-8 scores. A crucial detail: DeepSeek-V3 uses sigmoid for affinities (not softmax over all experts). This means each expert's score is independent — changing one expert's affinity doesn't affect another's.

Auxiliary-Loss-Free Load Balancing

The biggest practical problem with MoE is load imbalance. If most tokens flock to a few popular experts, those experts become bottlenecks and the rest sit idle. The standard fix is an auxiliary loss that penalizes imbalance. But this loss fights the main training objective — it forces suboptimal routing to achieve balance, degrading model quality.

DeepSeek-V3's solution is elegant: dynamic bias terms. Each routed expert i has a learnable bias b_i. When the router decides which experts a token visits, it uses s_i,t + b_i for ranking. But when it computes the actual gating weight (the scalar that multiplies the expert's output), it uses only s_i,t — the bias is invisible to the forward pass.

Routing decision: top-K of {s_i,t + b_i}
Gating weight: g_i,t = s_i,t / Σ_{j ∈ top-K} s_j,t

At the end of each training step, the system checks expert loads across the entire batch. Overloaded experts get b_i decreased by γ. Underloaded experts get b_i increased by γ. The bias nudges routing toward balance without ever adding a term to the loss function.

Why this works better: Auxiliary losses create a tension: the model wants to route tokens to the best expert, but the loss penalizes uneven routing. The bias trick eliminates this tension. The gating weights (and thus gradients flowing through experts) only depend on true affinity scores. The bias only affects which experts are selected, not how much they contribute. Balance is achieved as a side effect of routing adjustments, not as a competing optimization objective. Ablation: the auxiliary-loss-free strategy improves over auxiliary-loss-based routing on MMLU (79.1 vs 78.6), MATH (66.2 vs 63.9), and HumanEval (57.3 vs 56.7) at the 228B parameter scale.

Expert Routing Simulator

16 experts, top-2 routing. Click "Route Tokens" to send a batch of tokens. Watch how the bias terms (orange bars) adjust to balance the load. Overloaded experts get their bias decreased; underloaded experts get it increased.

Ready

Node-limited routing: With 256 experts distributed across multiple nodes, sending a token to any expert would require expensive cross-node communication. DeepSeek-V3 limits each token to at most M nodes. The top-K_r experts are chosen from those nodes with the highest combined affinity. This allows nearly full computation-communication overlap during training.

Why does DeepSeek-V3 use bias terms for load balancing instead of an auxiliary loss?

Because the bias only affects which experts are selected (routing), not how much they contribute (gating weights), so it achieves balance without degrading the training objective — unlike auxiliary losses which force suboptimal routing Because bias terms are cheaper to compute than auxiliary losses Because auxiliary losses require a separate optimizer

Chapter 4: Multi-Token Prediction

Standard language model training uses a next-token prediction objective: at each position t, predict token t+1. This gives one gradient signal per position. DeepSeek-V3 extends this to multi-token prediction (MTP): at position t, predict tokens t+1 and t+2.

Why? Two reasons:

Denser training signal. Each position now generates two predictions instead of one. The model extracts more learning from the same data. On 14.8T tokens, this is like getting a significant fraction of additional training for free.
Better representations. To predict not just the next token but the one after, the model's hidden states must encode information about the further future. This forces the model to build richer, more forward-looking representations — the kind that help with complex reasoning.

How It Works

Unlike Gloeckle et al. (2024) who use parallel independent heads, DeepSeek-V3 uses sequential prediction with a causal chain. Here's the setup:

Depth 0: Main Model

The full Transformer processes input tokens t₁,...,t_T. Its output head predicts t₂,...,t_T+1 (standard next-token prediction). This produces hidden states h₁,...,h_T.

↓

Depth 1: MTP Module

An MTP module (shared output head + small Transformer layer) takes the embedding of the depth-0 prediction (t₂) and concatenates it with h₁ to predict t₃. The causal chain: to predict t₃, you need the representation from predicting t₂.

↓

Training Loss

L = L_NTP + λ · L_MTP. The MTP loss has weight λ = 0.3. Both are cross-entropy losses. At inference, the MTP modules are discarded — zero extra cost.

Sequential vs. parallel MTP: Gloeckle et al. predict future tokens independently (one head per depth, no interaction between depths). DeepSeek-V3 maintains a complete causal chain: the prediction at depth k depends on the prediction at depth k-1. This means the MTP module learns to refine its understanding sequentially — "if the next token is X, then the one after is probably Y." The causal chain also enables speculative decoding at inference: the MTP modules can propose multiple tokens simultaneously, and the main model verifies them. This accelerates generation by 1.8x (from 14.4 to 7.9 tokens per second on TPS benchmarks).

Concretely, the MTP module at depth k takes two inputs: (1) the hidden state h_t^k-1 from the previous depth, and (2) the embedding of the predicted token at depth k-1. These are concatenated, projected through a linear layer, and fed through a single Transformer layer. The output is then passed through the shared output head (same as the main model's language model head) to produce next-token logits.

Ablation results: At the 15.7B scale (1.33T tokens), MTP improves: HumanEval 50.0 → 53.0, MBPP 60.2 → 64.0, MATH 46.4 → 49.0, GSM8K 74.0 → 77.4. At the 228.7B scale (540B tokens), MTP improves: HumanEval 57.3 → 60.4, MATH 62.2 → 66.2. Improvements are consistent across scales and benchmarks. At inference, the MTP module is dropped — the improvements come for free.

Multi-Token Prediction

Standard NTP (gray) predicts one token ahead. MTP (orange) predicts two. Toggle to see how the model sees the sequence at position t — and what it must predict.

Why can the MTP modules be discarded at inference without losing the quality improvements they provided during training?

Because the MTP modules share weights with the main model Because the MTP objective is too noisy to help at inference time Because MTP improved the main model's hidden representations during training — those representations are permanently better, so the main model's next-token prediction benefits even after the MTP head is removed

Chapter 5: FP8 Training

Most large model training uses BF16 (bfloat16): 16 bits per number, with 8 bits for exponent and 7 for mantissa. This gives good dynamic range but uses 2 bytes per parameter and per activation. DeepSeek-V3 is the first model at this scale to train with FP8 (8 bits): half the memory, nearly double the throughput.

But FP8 has a fundamental problem: with only 8 bits, you get either decent range (E5M2: 5 exponent, 2 mantissa bits — range of BF16 but almost no precision) or decent precision (E4M3: 4 exponent, 3 mantissa bits — better precision but limited range). Neither is great on its own.

The Mixed-Precision Strategy

DeepSeek-V3's solution uses FP8 for the expensive parts and higher precision for the sensitive parts:

Operation	Precision	Why
Linear layer forward (GEMM)	FP8 (E4M3)	Bulk of compute. FP8 Tensor Cores are 2x faster.
Linear layer backward (GEMM)	FP8 (E4M3)	Gradient matmuls also benefit from 2x speedup.
Accumulation in GEMM	FP32	Tiny errors compound over 7168-dim dot products. Must accumulate in high precision.
Attention, normalization, routing	BF16 / FP32	These are numerically sensitive. FP8 here causes instability.
Optimizer states	FP32	Adam momentum and variance need full precision.
Master weights	FP32	Updates are tiny; low precision would lose them.

Block-Wise Quantization

A single FP8 scale factor per tensor is too coarse — if one element is 1000x larger than the rest, the scale must accommodate it and the small values get crushed to zero. DeepSeek-V3 uses fine-grained block-wise quantization: for a 2D weight matrix, each 128-element block along the inner dimension gets its own scale factor. For activations, each 1×128 tile gets its own scale. This preserves precision across the wildly varying magnitudes in real neural network tensors.

Why this was unprecedented: Previous FP8 training attempts either used small models (< 100B) or degraded quality. DeepSeek-V3's FP8 training on 671B parameters matches BF16 quality on all benchmarks. The trick is threefold: (1) block-wise quantization with 128-element groups preserves local precision, (2) FP32 accumulation prevents dot-product error from compounding over the 7168 hidden dimension, and (3) only the linear GEMMs use FP8 while attention/norm/routing stay in BF16. This selective approach captures 90%+ of the speedup (since linear layers dominate compute) while keeping the sensitive operations accurate.

Concrete savings: FP8 cuts activation memory roughly in half (8 bits vs 16 bits per element). For a 671B model with 14.8T training tokens, this translates to training on 2,048 H800 GPUs without tensor parallelism. Without FP8, the memory footprint would require either tensor parallelism (expensive communication overhead) or fewer, more expensive GPUs. The throughput gain is roughly 1.5–2x per GPU for the dominant linear operations.

FP8 vs BF16 Precision

See how FP8 E4M3 represents numbers compared to BF16. Drag the value slider to see quantization error. Note how block-wise scaling (orange) keeps error small by using a local scale factor, while tensor-wise scaling (gray) loses precision on small values.

Value 1.50

Max in Block 5

Why does DeepSeek-V3 use FP32 accumulation inside FP8 matrix multiplications?

Because each dot product sums ~7168 FP8 multiplications, and tiny rounding errors compound over thousands of additions — FP32 accumulation prevents this drift from corrupting the result Because FP8 cannot represent numbers larger than 256 Because the GPU hardware requires FP32 accumulation for all operations

Chapter 6: Pre-training

Pre-training is where the model consumes raw text and learns the statistical structure of language. For DeepSeek-V3, this stage consumed 2,664K H800 GPU hours — 95.6% of the total training budget. Everything about it was optimized for stability and throughput.

Data Pipeline

DeepSeek-V3 was pre-trained on 14.8 trillion tokens of diverse, high-quality data. The composition was intentionally tuned:

Emphasis on math and code data was increased compared to DeepSeek-V2, improving reasoning capabilities.
A document packing method was used to stitch multiple documents into each training sequence without cross-document attention leakage — a mask prevents tokens from attending to tokens in different documents within the same pack.
The tokenizer uses Byte-level BPE with a vocabulary of 128K tokens, extended to 129,280 for efficiency (multiple of GPU tile sizes).

Training Infrastructure: DualPipe

With 2,048 H800 GPUs connected by InfiniBand and NVLink, the training framework uses:

Pipeline parallelism (16-way) with DualPipe: a novel scheduling algorithm that overlaps computation and communication for both the forward and backward passes. Standard pipeline parallelism wastes time in "bubbles" (idle GPUs waiting for data). DualPipe reduces bubbles to near zero by interleaving micro-batches from both ends of the pipeline simultaneously.
Expert parallelism (64-way) for the MoE layers: experts are distributed across nodes, and tokens are routed via all-to-all communication. Custom kernels fully utilize both InfiniBand (inter-node, 400 Gbps) and NVLink (intra-node, 160 GB/s).
No tensor parallelism. Thanks to FP8's memory savings, DeepSeek-V3 fits within the memory budget without splitting individual operations across GPUs — eliminating the most communication-heavy parallelism strategy.

Training stability: Throughout the entire pre-training on 14.8T tokens, DeepSeek-V3 experienced zero irrecoverable loss spikes and required zero rollbacks. This is extraordinary for a model of this scale. For comparison, many frontier model training runs report multiple loss spikes requiring rollbacks to earlier checkpoints (wasting days of compute). The authors attribute this stability to: (1) careful hyperparameter tuning, (2) the auxiliary-loss-free balancing preventing routing collapse, and (3) FP8 training with sufficient high-precision safeguards.

Hyperparameters

Hyperparameter	Value
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Learning rate	Peak 2.2×10^-4, cosine decay
Weight decay	0.1
Batch size ramp	3072 → 15360 sequences
Sequence length	4096 tokens (pre-training), extended to 32K then 128K
Gradient clipping	1.0

Context Length Extension

After the main pre-training phase, context length was extended in two stages: first to 32K tokens, then to 128K. This used YaRN (Yet another RoPE extension) to rescale the rotary position embeddings. Each stage trained for a short additional period — 119K GPU hours total, a small fraction of the pre-training cost.

Cost breakdown: Pre-training: 2,664K GPU-hours ($5.328M) | Context extension: 119K GPU-hours ($0.238M) | Post-training: 5K GPU-hours ($0.01M) | Total: 2,788K GPU-hours ($5.576M). Per trillion tokens: 180K GPU-hours = 3.7 days on 2,048 H800s. This is roughly 11x cheaper than Llama 3.1 405B's 30.8M GPU-hours.

What parallelism strategy does DeepSeek-V3 intentionally avoid, and why?

Tensor parallelism — because FP8 training reduces memory enough that individual operations don't need to be split across GPUs, eliminating the heavy communication overhead that tensor parallelism requires Pipeline parallelism, because DualPipe makes it unnecessary Data parallelism, because MoE already distributes the workload

Chapter 7: Post-training

The pre-trained base model is a powerful text predictor, but it doesn't follow instructions or align with human preferences. Post-training transforms it into a helpful assistant through two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

Supervised Fine-Tuning

DeepSeek-V3 was fine-tuned on 1.5 million instruction-response pairs spanning diverse domains:

Math and reasoning: Problems with verified solutions, many generated by an internal reasoning engine.
Code: Programming tasks with compiler-verified outputs.
Creative writing: High-quality human-written examples plus model-generated data with quality filtering.
General Q&A: Diverse instruction-following tasks.

During SFT, the MTP objective was maintained — the model continued predicting 2 tokens ahead, which helped the fine-tuned model retain the stronger representations learned during pre-training.

Reinforcement Learning

After SFT, RL further aligns the model using Group Relative Policy Optimization (GRPO). Unlike PPO, GRPO doesn't require a separate critic network. Instead, it generates multiple responses per prompt, scores them with a reward model, and updates the policy based on relative rankings within each group.

Two types of reward signals were used:

Rule-based rewards: For math and code, correctness is verified automatically (compiler passes, answer matches). This gives a binary, unambiguous signal.
Model-based rewards: For open-ended tasks, a reward model scores helpfulness, harmlessness, and quality. The reward model was itself trained on human preference data.

Distillation from DeepSeek-R1: A remarkable innovation: DeepSeek-V3's reasoning capabilities were boosted by distilling from the DeepSeek-R1 series (long chain-of-thought models). During SFT, some training examples included R1-style reasoning traces — step-by-step problem-solving that the V3 model learned to internalize. The pipeline elegantly incorporates R1's verification and reflection patterns into V3 while maintaining control over output length (V3 doesn't need to produce the full CoT at inference). This contributed significant gains on AIME 2024 (math competition): from 23.6% (no distillation) to 39.2% (with R1 distillation).

Self-rewarding: DeepSeek-V3 can also serve as its own reward model. When used as a generative reward model (scoring responses via prompted evaluation rather than a trained scalar head), it achieves competitive accuracy on RewardBench (92.3%), nearly matching specialized reward models. This enables a potential self-improvement loop where the model evaluates and improves its own outputs.

How does distillation from DeepSeek-R1 improve DeepSeek-V3's reasoning without making it produce long chains of thought?

R1-style reasoning traces are included in SFT training data, so V3 internalizes the verification and reflection patterns into its representations — but output length is controlled during training so V3 doesn't need to explicitly produce the full CoT at inference R1's weights are directly merged into V3 using model averaging V3 is trained to compress R1's CoT into a single token

Chapter 8: Results

DeepSeek-V3 was evaluated against both open-source models (Llama 3.1 405B, Qwen 2.5 72B) and closed-source frontier models (GPT-4o, Claude 3.5 Sonnet). The results position it as the strongest open-source model and competitive with the best closed-source systems.

Knowledge and Reasoning

Benchmark	DeepSeek-V3	GPT-4o	Claude 3.5	Llama 405B
MMLU	88.5	87.2	88.3	88.6
MMLU-Pro	75.9	72.6	78.0	73.3
GPQA-Diamond	59.1	49.9	65.0	51.1
DROP (F1)	91.6	83.7	88.3	88.7

Math

Benchmark	DeepSeek-V3	GPT-4o	Claude 3.5	Llama 405B
MATH-500	90.2	74.6	78.3	73.8
AIME 2024	39.2	9.3	16.0	23.3
GSM8K	95.5	92.2	95.0	89.0

Code

Benchmark	DeepSeek-V3	GPT-4o	Claude 3.5	Llama 405B
HumanEval	82.6	90.2	93.7	89.0
LiveCodeBench	40.5	34.2	38.8	28.5
Codeforces (pct.)	51.6	23.6	20.3	25.3
SWE-bench Verified	42.0	38.8	50.8	33.4

The standout results: On MATH-500, DeepSeek-V3 scores 90.2% — surpassing GPT-4o (74.6%) and Claude 3.5 Sonnet (78.3%) by a wide margin. On AIME 2024 (competition math), it reaches 39.2% vs GPT-4o's 9.3%. On Codeforces (competitive programming), it achieves the 51.6th percentile vs 23.6th for GPT-4o. These are not small differences — they represent a qualitative leap in math and code reasoning for an open-source model.

Benchmark Comparison

Performance across six key benchmarks. DeepSeek-V3 (orange) vs GPT-4o (gray) vs Claude 3.5 Sonnet (teal). Higher is better.

Cost efficiency: DeepSeek-V3 achieves these results at 2.788M GPU-hours ($5.57M). Llama 3.1 405B used 30.8M GPU-hours. Even accounting for the fact that Llama uses dense architecture (no MoE routing overhead), DeepSeek-V3 is roughly 11x cheaper to train while delivering competitive or superior performance on most benchmarks.

On which category of benchmarks does DeepSeek-V3 show the most dramatic advantage over GPT-4o?

Math — scoring 90.2% on MATH-500 (vs 74.6%), 39.2% on AIME 2024 (vs 9.3%), and 51.6th percentile on Codeforces (vs 23.6th) — a qualitative leap attributable in part to R1 distillation and increased math/code pre-training data General knowledge (MMLU) Long-context tasks

Chapter 9: Connections

DeepSeek-V3 sits at the intersection of several major research threads. Let's map where each innovation came from and where it leads.

MLA's Lineage

Multi-Query Attention (MQA, Shazeer 2019) first proposed sharing K,V across heads to reduce cache. Grouped-Query Attention (GQA, Ainslie et al. 2023) added groups of heads sharing K,V. MLA goes further: instead of sharing, it compresses all K,V into a joint latent. This is strictly more powerful — the up-projection can learn different K,V for each head from the same latent. MLA first appeared in DeepSeek-V2 and is retained unchanged in V3.

The MoE Thread

Shazeer et al. (2017) introduced modern MoE for Transformers with GShard. Switch Transformers (Fedus et al. 2021) simplified to top-1 routing. Mixtral (Jiang et al. 2024) demonstrated competitive dense-model performance. DeepSeekMoE (Dai et al. 2024) introduced fine-grained experts (256 small experts vs. 8 large ones) and shared experts. The auxiliary-loss-free balancing in V3 is the latest evolution — solving the quality-vs-balance trade-off that plagued all prior MoE work.

Relation to DeepSeek-R1

DeepSeek-R1 is a separate model family focused on long chain-of-thought reasoning (trained with RL to produce step-by-step solutions). V3's post-training distills R1's reasoning patterns into a standard-length output model. This "best of both worlds" approach — R1-quality reasoning without R1-length outputs — is a major practical innovation.

Relation to Kimi K2

Moonshot AI's Kimi K2 (2025) builds on many of the same ideas: MoE architecture, auxiliary-loss-free balancing (citing DeepSeek-V3), and MTP. K2 extends the approach with Muon optimizer and DAPO for RL, representing the next generation of this design family.

Cheat Sheet

Aspect	DeepSeek-V3
Total parameters	671B (37B activated per token)
Architecture	Transformer + MLA + DeepSeekMoE
Attention	MLA: d_c=512, d'_c=1536, 128 heads
MoE	1 shared + 256 routed experts, top-8
Load balancing	Auxiliary-loss-free (dynamic bias)
Training objective	NTP + MTP (λ=0.3, depth=1)
Training precision	FP8 mixed (linear layers) + BF16/FP32
Pre-training tokens	14.8T
Context length	128K (extended from 4K via YaRN)
Training cost	2.788M H800 GPU-hours ($5.57M)
Hardware	2,048 H800 GPUs
Parallelism	Pipeline (DualPipe) + Expert (no tensor)
Training stability	Zero loss spikes, zero rollbacks
Post-training	SFT (1.5M pairs) + GRPO + R1 distillation
Key results	MATH-500: 90.2%, AIME: 39.2%, Codeforces: 51.6 pct

The broader lesson: DeepSeek-V3 demonstrates that frontier model performance doesn't require frontier budgets. Through careful architectural design (MLA, MoE), training innovation (FP8, DualPipe), and clever post-training (R1 distillation, GRPO), a $5.57M model can compete with systems costing orders of magnitude more. The future of large models may be less about raw scale and more about engineering efficiency.

What is the key difference between how MLA compresses the KV cache vs. how Grouped-Query Attention (GQA) reduces it?

GQA uses fewer parameters than MLA GQA shares identical K,V vectors across groups of heads (less flexibility). MLA compresses all K,V into a low-rank latent and reconstructs different K,V per head via learned up-projections (more flexible, higher compression). MLA doesn't use positional encoding while GQA does

DeepSeek-V3 Technical Report