DeepSeek-AI — 2024

DeepSeek-V3 Technical Report

671B parameters, 37B activated per token. Multi-head Latent Attention compresses the KV cache. DeepSeekMoE routes tokens without auxiliary losses. FP8 training from scratch. All for $5.57M and zero loss spikes.

Prerequisites: Transformer attention + Mixture of Experts basics + Language model training
10
Chapters
5+
Simulations

Chapter 0: The Problem

You want to build a frontier language model — one that rivals GPT-4o and Claude 3.5 Sonnet on math, code, and reasoning. The obvious approach is to scale up a dense Transformer: more layers, more heads, more parameters. But you hit two walls immediately.

Wall 1: Training cost. GPT-4 reportedly cost over $100M to train. Llama-3.1 405B required 30.8M GPU hours. At $2/GPU-hour, that's $61.6M just for pre-training. These numbers make frontier models a game only a handful of companies can play.

Wall 2: Inference cost. A 405B dense model activates every parameter on every token. The KV cache alone — storing all the keys and values from previous tokens for attention — grows linearly with context length and model width. For a 405B model serving long contexts, you need expensive multi-GPU setups just to hold the model in memory, let alone run it fast.

The core tension: Capacity (total parameters) drives quality. But activation cost (parameters used per token) drives both training and inference expense. A 671B model that activates only 37B per token gets the capacity of a massive model at the cost of a medium one — if you can make the routing, attention, and training efficient enough.

DeepSeek-V3 attacks both walls simultaneously. It is a Mixture-of-Experts (MoE) model with 671B total parameters but only 37B activated per token. It introduces Multi-head Latent Attention (MLA) to compress the KV cache by an order of magnitude. And it pioneers FP8 mixed-precision training to halve the memory and compute per operation. The result: a model that matches GPT-4o on major benchmarks, trained for 2.788M H800 GPU hours — roughly $5.57M.

Zero loss spikes. No rollbacks. This is the report of how they did it.

The numbers at a glance: 671B total parameters | 37B activated per token | 14.8T pre-training tokens | 2,048 H800 GPUs | 2.788M GPU-hours total ($5.57M) | 128K context length | Pre-training: 2,664K GPU-hours | Context extension: 119K GPU-hours | Post-training: 5K GPU-hours | Zero irrecoverable loss spikes | Zero rollbacks.
Dense vs. MoE Cost

Drag the slider to set total model parameters. Compare the FLOPs per token for a dense model (activates everything) vs. an MoE model (activates only a fraction). The gap grows dramatically with scale.

Total Parameters (B) 671B
Why does a 671B MoE model cost less to run per token than a 405B dense model?

Chapter 1: The Key Insight

DeepSeek-V3's design philosophy is a triple decomposition: separate capacity from activation cost (MoE), separate cache size from attention quality (MLA), and separate numerical precision from training stability (FP8). Each decomposition attacks a different cost bottleneck.

MoE: Capacity without Cost
671B parameters give the model massive knowledge capacity. But only 37B are active per token (8 routed experts + 1 shared expert per layer). Training cost scales with activated params, not total params.
MLA: Small Cache, Full Quality
Standard multi-head attention stores K and V for every head at every position — enormous cache. MLA compresses K,V into a low-dimensional latent vector c. Cache stores c (512 dims) instead of full K,V (16,384 dims). Reconstructs K,V on the fly during attention.
FP8: Half the Bits, Same Quality
Most training uses BF16 (16-bit). DeepSeek-V3 trains in FP8 (8-bit) for the heavy linear layers. This halves memory and nearly doubles throughput. But FP8 has tiny dynamic range — requires careful block-wise quantization and high-precision accumulators.
Result: $5.57M for a Frontier Model
2.788M H800 GPU hours total. Pre-training on 14.8T tokens took less than 2 months on 2,048 GPUs. Per trillion tokens: only 180K GPU-hours (3.7 days on the cluster).
Why these three together? MoE alone reduces per-token FLOPs but creates a communication bottleneck (routing tokens to remote experts). FP8 alone risks training instability at scale. MLA alone doesn't reduce training cost. Together, they compound: MoE cuts FLOPs, MLA cuts memory, FP8 cuts both — enabling a model that trains faster, serves cheaper, and performs better than any open-source competitor.

Two additional innovations make the system work in practice:

Architecture summary: 61 Transformer layers | 128 attention heads | dmodel = 7168 | MLA with dc = 512 (KV compression), d'c = 1536 (query compression), dhR = 64 (RoPE) | DeepSeekMoE: 1 shared expert + 256 routed experts per layer, top-8 routing | Sigmoid gating (not softmax) | Vocabulary: 129,280 tokens.
What are the three cost bottlenecks that DeepSeek-V3's architecture simultaneously attacks?

Chapter 2: Multi-head Latent Attention

Standard multi-head attention (MHA) stores a key vector and a value vector for every token at every layer and every head. For DeepSeek-V3 with 128 heads and dh = 128, that is 128 × 128 = 16,384 dimensions of keys plus 16,384 dimensions of values per token per layer. For a 128K context, this cache is enormous.

MLA's insight: the keys and values across all heads are highly redundant. Instead of storing them directly, compress them into a much smaller latent vector and reconstruct the full keys and values on the fly.

The Compression

Given the hidden state ht at position t, MLA first compresses it into a latent vector:

ctKV = WDKV ht ∈ Rdc

where dc = 512 is the KV compression dimension. Compare this to dhnh = 128 × 128 = 16,384 for standard MHA. That is a 32x compression.

During inference, the KV cache stores only ctKV (512 dims) plus a small RoPE key ktR (64 dims) per token per layer. The full keys and values are reconstructed from the latent when needed:

ktC = WUK ctKV ∈ Rdhnh
vtC = WUV ctKV ∈ Rdhnh

The up-projection matrices WUK and WUV are learned parameters. They reconstruct the full-dimensional keys and values for all 128 heads from the tiny 512-dim latent.

Why RoPE Needs Special Treatment

Rotary Position Embeddings (RoPE) are applied after projection, rotating query and key vectors based on their positions. But RoPE is position-dependent — you can't absorb it into the latent compression because the compression happens before you know the relative positions at attention time.

MLA solves this by maintaining a decoupled RoPE key: a separate, small key vector ktR ∈ R64 that carries RoPE. The final key is a concatenation:

kt,i = [kt,iC ; ktR]

The content key kt,iC comes from the latent (shared across heads after up-projection). The RoPE key ktR is shared across all heads (only 64 dims). So the total KV cache per token per layer is dc + dhR = 512 + 64 = 576 dimensions, down from 32,768 in standard MHA.

Full MLA data flow: ht ∈ R7168 → WDKV → ctKV ∈ R512 (this is cached) | ctKV → WUK → ktC ∈ R16384 (reconstructed, not cached) | ctKV → WUV → vtC ∈ R16384 (reconstructed, not cached) | ht → WKR → RoPE → ktR ∈ R64 (this is cached). Cache per token per layer: 512 + 64 = 576 floats. Standard MHA cache: 2 × 16,384 = 32,768 floats. Compression ratio: ~57x.

Queries get the same low-rank treatment to reduce activation memory during training:

ctQ = WDQ ht ∈ R1536 → WUQ → qtC ∈ R16384

A separate RoPE query is produced and concatenated, matching the key structure. Then standard scaled dot-product attention proceeds as usual.

MLA Compression Visualizer

Drag the KV compression dimension dc to see how cache size changes. The orange bar shows MLA's cache; the gray bar shows standard MHA. Watch the compression ratio explode as dc shrinks.

dc (KV compression dim) 512
The key trick for inference: During generation, the model only needs to load ctKV from the cache. The up-projection (WUK and WUV) can be absorbed into the query projection and output projection matrices respectively, so you never actually reconstruct the full K,V tensors. The attention is computed directly on the latent. This makes MLA not just memory-efficient but also compute-efficient during decoding.
In MLA, what is stored in the KV cache per token per layer?

Chapter 3: DeepSeekMoE

In a standard Transformer, every token passes through the same feed-forward network (FFN). In a Mixture-of-Experts model, the FFN is replaced by many parallel "expert" FFNs, and a router decides which experts each token visits. Most tokens only activate a few experts, so compute per token stays small even as you add more experts (and more total parameters).

DeepSeek-V3's MoE layer has a distinctive structure:

ComponentCountPurpose
Shared experts1 per layerAlways active. Captures common knowledge that every token needs.
Routed experts256 per layerSpecialized. Each token visits its top-8 (Kr=8) by affinity score.
Router1 per layerComputes token-expert affinity si,t = sigmoid(utT ei) for routing.

The output for token t is:

h't = ut + FFNshared(ut) + Σi ∈ top-8 gi,t · FFNi(ut)

where gi,t is the normalized gating weight: the affinity score si,t divided by the sum of the top-8 scores. A crucial detail: DeepSeek-V3 uses sigmoid for affinities (not softmax over all experts). This means each expert's score is independent — changing one expert's affinity doesn't affect another's.

Auxiliary-Loss-Free Load Balancing

The biggest practical problem with MoE is load imbalance. If most tokens flock to a few popular experts, those experts become bottlenecks and the rest sit idle. The standard fix is an auxiliary loss that penalizes imbalance. But this loss fights the main training objective — it forces suboptimal routing to achieve balance, degrading model quality.

DeepSeek-V3's solution is elegant: dynamic bias terms. Each routed expert i has a learnable bias bi. When the router decides which experts a token visits, it uses si,t + bi for ranking. But when it computes the actual gating weight (the scalar that multiplies the expert's output), it uses only si,tthe bias is invisible to the forward pass.

Routing decision: top-K of {si,t + bi}
Gating weight: gi,t = si,t / Σj ∈ top-K sj,t

At the end of each training step, the system checks expert loads across the entire batch. Overloaded experts get bi decreased by γ. Underloaded experts get bi increased by γ. The bias nudges routing toward balance without ever adding a term to the loss function.

Why this works better: Auxiliary losses create a tension: the model wants to route tokens to the best expert, but the loss penalizes uneven routing. The bias trick eliminates this tension. The gating weights (and thus gradients flowing through experts) only depend on true affinity scores. The bias only affects which experts are selected, not how much they contribute. Balance is achieved as a side effect of routing adjustments, not as a competing optimization objective. Ablation: the auxiliary-loss-free strategy improves over auxiliary-loss-based routing on MMLU (79.1 vs 78.6), MATH (66.2 vs 63.9), and HumanEval (57.3 vs 56.7) at the 228B parameter scale.
Expert Routing Simulator

16 experts, top-2 routing. Click "Route Tokens" to send a batch of tokens. Watch how the bias terms (orange bars) adjust to balance the load. Overloaded experts get their bias decreased; underloaded experts get it increased.

Ready
Node-limited routing: With 256 experts distributed across multiple nodes, sending a token to any expert would require expensive cross-node communication. DeepSeek-V3 limits each token to at most M nodes. The top-Kr experts are chosen from those nodes with the highest combined affinity. This allows nearly full computation-communication overlap during training.
Why does DeepSeek-V3 use bias terms for load balancing instead of an auxiliary loss?

Chapter 4: Multi-Token Prediction

Standard language model training uses a next-token prediction objective: at each position t, predict token t+1. This gives one gradient signal per position. DeepSeek-V3 extends this to multi-token prediction (MTP): at position t, predict tokens t+1 and t+2.

Why? Two reasons:

How It Works

Unlike Gloeckle et al. (2024) who use parallel independent heads, DeepSeek-V3 uses sequential prediction with a causal chain. Here's the setup:

Depth 0: Main Model
The full Transformer processes input tokens t1,...,tT. Its output head predicts t2,...,tT+1 (standard next-token prediction). This produces hidden states h1,...,hT.
Depth 1: MTP Module
An MTP module (shared output head + small Transformer layer) takes the embedding of the depth-0 prediction (t2) and concatenates it with h1 to predict t3. The causal chain: to predict t3, you need the representation from predicting t2.
Training Loss
L = LNTP + λ · LMTP. The MTP loss has weight λ = 0.3. Both are cross-entropy losses. At inference, the MTP modules are discarded — zero extra cost.
Sequential vs. parallel MTP: Gloeckle et al. predict future tokens independently (one head per depth, no interaction between depths). DeepSeek-V3 maintains a complete causal chain: the prediction at depth k depends on the prediction at depth k-1. This means the MTP module learns to refine its understanding sequentially — "if the next token is X, then the one after is probably Y." The causal chain also enables speculative decoding at inference: the MTP modules can propose multiple tokens simultaneously, and the main model verifies them. This accelerates generation by 1.8x (from 14.4 to 7.9 tokens per second on TPS benchmarks).

Concretely, the MTP module at depth k takes two inputs: (1) the hidden state htk-1 from the previous depth, and (2) the embedding of the predicted token at depth k-1. These are concatenated, projected through a linear layer, and fed through a single Transformer layer. The output is then passed through the shared output head (same as the main model's language model head) to produce next-token logits.

Ablation results: At the 15.7B scale (1.33T tokens), MTP improves: HumanEval 50.0 → 53.0, MBPP 60.2 → 64.0, MATH 46.4 → 49.0, GSM8K 74.0 → 77.4. At the 228.7B scale (540B tokens), MTP improves: HumanEval 57.3 → 60.4, MATH 62.2 → 66.2. Improvements are consistent across scales and benchmarks. At inference, the MTP module is dropped — the improvements come for free.
Multi-Token Prediction

Standard NTP (gray) predicts one token ahead. MTP (orange) predicts two. Toggle to see how the model sees the sequence at position t — and what it must predict.

Why can the MTP modules be discarded at inference without losing the quality improvements they provided during training?

Chapter 5: FP8 Training

Most large model training uses BF16 (bfloat16): 16 bits per number, with 8 bits for exponent and 7 for mantissa. This gives good dynamic range but uses 2 bytes per parameter and per activation. DeepSeek-V3 is the first model at this scale to train with FP8 (8 bits): half the memory, nearly double the throughput.

But FP8 has a fundamental problem: with only 8 bits, you get either decent range (E5M2: 5 exponent, 2 mantissa bits — range of BF16 but almost no precision) or decent precision (E4M3: 4 exponent, 3 mantissa bits — better precision but limited range). Neither is great on its own.

The Mixed-Precision Strategy

DeepSeek-V3's solution uses FP8 for the expensive parts and higher precision for the sensitive parts:

OperationPrecisionWhy
Linear layer forward (GEMM)FP8 (E4M3)Bulk of compute. FP8 Tensor Cores are 2x faster.
Linear layer backward (GEMM)FP8 (E4M3)Gradient matmuls also benefit from 2x speedup.
Accumulation in GEMMFP32Tiny errors compound over 7168-dim dot products. Must accumulate in high precision.
Attention, normalization, routingBF16 / FP32These are numerically sensitive. FP8 here causes instability.
Optimizer statesFP32Adam momentum and variance need full precision.
Master weightsFP32Updates are tiny; low precision would lose them.

Block-Wise Quantization

A single FP8 scale factor per tensor is too coarse — if one element is 1000x larger than the rest, the scale must accommodate it and the small values get crushed to zero. DeepSeek-V3 uses fine-grained block-wise quantization: for a 2D weight matrix, each 128-element block along the inner dimension gets its own scale factor. For activations, each 1×128 tile gets its own scale. This preserves precision across the wildly varying magnitudes in real neural network tensors.

Why this was unprecedented: Previous FP8 training attempts either used small models (< 100B) or degraded quality. DeepSeek-V3's FP8 training on 671B parameters matches BF16 quality on all benchmarks. The trick is threefold: (1) block-wise quantization with 128-element groups preserves local precision, (2) FP32 accumulation prevents dot-product error from compounding over the 7168 hidden dimension, and (3) only the linear GEMMs use FP8 while attention/norm/routing stay in BF16. This selective approach captures 90%+ of the speedup (since linear layers dominate compute) while keeping the sensitive operations accurate.
Concrete savings: FP8 cuts activation memory roughly in half (8 bits vs 16 bits per element). For a 671B model with 14.8T training tokens, this translates to training on 2,048 H800 GPUs without tensor parallelism. Without FP8, the memory footprint would require either tensor parallelism (expensive communication overhead) or fewer, more expensive GPUs. The throughput gain is roughly 1.5–2x per GPU for the dominant linear operations.
FP8 vs BF16 Precision

See how FP8 E4M3 represents numbers compared to BF16. Drag the value slider to see quantization error. Note how block-wise scaling (orange) keeps error small by using a local scale factor, while tensor-wise scaling (gray) loses precision on small values.

Value 1.50
Max in Block 5
Why does DeepSeek-V3 use FP32 accumulation inside FP8 matrix multiplications?

Chapter 6: Pre-training

Pre-training is where the model consumes raw text and learns the statistical structure of language. For DeepSeek-V3, this stage consumed 2,664K H800 GPU hours — 95.6% of the total training budget. Everything about it was optimized for stability and throughput.

Data Pipeline

DeepSeek-V3 was pre-trained on 14.8 trillion tokens of diverse, high-quality data. The composition was intentionally tuned:

Training Infrastructure: DualPipe

With 2,048 H800 GPUs connected by InfiniBand and NVLink, the training framework uses:

Training stability: Throughout the entire pre-training on 14.8T tokens, DeepSeek-V3 experienced zero irrecoverable loss spikes and required zero rollbacks. This is extraordinary for a model of this scale. For comparison, many frontier model training runs report multiple loss spikes requiring rollbacks to earlier checkpoints (wasting days of compute). The authors attribute this stability to: (1) careful hyperparameter tuning, (2) the auxiliary-loss-free balancing preventing routing collapse, and (3) FP8 training with sufficient high-precision safeguards.

Hyperparameters

HyperparameterValue
OptimizerAdamW (β1=0.9, β2=0.95)
Learning ratePeak 2.2×10-4, cosine decay
Weight decay0.1
Batch size ramp3072 → 15360 sequences
Sequence length4096 tokens (pre-training), extended to 32K then 128K
Gradient clipping1.0

Context Length Extension

After the main pre-training phase, context length was extended in two stages: first to 32K tokens, then to 128K. This used YaRN (Yet another RoPE extension) to rescale the rotary position embeddings. Each stage trained for a short additional period — 119K GPU hours total, a small fraction of the pre-training cost.

Cost breakdown: Pre-training: 2,664K GPU-hours ($5.328M) | Context extension: 119K GPU-hours ($0.238M) | Post-training: 5K GPU-hours ($0.01M) | Total: 2,788K GPU-hours ($5.576M). Per trillion tokens: 180K GPU-hours = 3.7 days on 2,048 H800s. This is roughly 11x cheaper than Llama 3.1 405B's 30.8M GPU-hours.
What parallelism strategy does DeepSeek-V3 intentionally avoid, and why?

Chapter 7: Post-training

The pre-trained base model is a powerful text predictor, but it doesn't follow instructions or align with human preferences. Post-training transforms it into a helpful assistant through two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

Supervised Fine-Tuning

DeepSeek-V3 was fine-tuned on 1.5 million instruction-response pairs spanning diverse domains:

During SFT, the MTP objective was maintained — the model continued predicting 2 tokens ahead, which helped the fine-tuned model retain the stronger representations learned during pre-training.

Reinforcement Learning

After SFT, RL further aligns the model using Group Relative Policy Optimization (GRPO). Unlike PPO, GRPO doesn't require a separate critic network. Instead, it generates multiple responses per prompt, scores them with a reward model, and updates the policy based on relative rankings within each group.

Two types of reward signals were used:

Distillation from DeepSeek-R1: A remarkable innovation: DeepSeek-V3's reasoning capabilities were boosted by distilling from the DeepSeek-R1 series (long chain-of-thought models). During SFT, some training examples included R1-style reasoning traces — step-by-step problem-solving that the V3 model learned to internalize. The pipeline elegantly incorporates R1's verification and reflection patterns into V3 while maintaining control over output length (V3 doesn't need to produce the full CoT at inference). This contributed significant gains on AIME 2024 (math competition): from 23.6% (no distillation) to 39.2% (with R1 distillation).
Self-rewarding: DeepSeek-V3 can also serve as its own reward model. When used as a generative reward model (scoring responses via prompted evaluation rather than a trained scalar head), it achieves competitive accuracy on RewardBench (92.3%), nearly matching specialized reward models. This enables a potential self-improvement loop where the model evaluates and improves its own outputs.
How does distillation from DeepSeek-R1 improve DeepSeek-V3's reasoning without making it produce long chains of thought?

Chapter 8: Results

DeepSeek-V3 was evaluated against both open-source models (Llama 3.1 405B, Qwen 2.5 72B) and closed-source frontier models (GPT-4o, Claude 3.5 Sonnet). The results position it as the strongest open-source model and competitive with the best closed-source systems.

Knowledge and Reasoning

BenchmarkDeepSeek-V3GPT-4oClaude 3.5Llama 405B
MMLU88.587.288.388.6
MMLU-Pro75.972.678.073.3
GPQA-Diamond59.149.965.051.1
DROP (F1)91.683.788.388.7

Math

BenchmarkDeepSeek-V3GPT-4oClaude 3.5Llama 405B
MATH-50090.274.678.373.8
AIME 202439.29.316.023.3
GSM8K95.592.295.089.0

Code

BenchmarkDeepSeek-V3GPT-4oClaude 3.5Llama 405B
HumanEval82.690.293.789.0
LiveCodeBench40.534.238.828.5
Codeforces (pct.)51.623.620.325.3
SWE-bench Verified42.038.850.833.4
The standout results: On MATH-500, DeepSeek-V3 scores 90.2% — surpassing GPT-4o (74.6%) and Claude 3.5 Sonnet (78.3%) by a wide margin. On AIME 2024 (competition math), it reaches 39.2% vs GPT-4o's 9.3%. On Codeforces (competitive programming), it achieves the 51.6th percentile vs 23.6th for GPT-4o. These are not small differences — they represent a qualitative leap in math and code reasoning for an open-source model.
Benchmark Comparison

Performance across six key benchmarks. DeepSeek-V3 (orange) vs GPT-4o (gray) vs Claude 3.5 Sonnet (teal). Higher is better.

Cost efficiency: DeepSeek-V3 achieves these results at 2.788M GPU-hours ($5.57M). Llama 3.1 405B used 30.8M GPU-hours. Even accounting for the fact that Llama uses dense architecture (no MoE routing overhead), DeepSeek-V3 is roughly 11x cheaper to train while delivering competitive or superior performance on most benchmarks.
On which category of benchmarks does DeepSeek-V3 show the most dramatic advantage over GPT-4o?

Chapter 9: Connections

DeepSeek-V3 sits at the intersection of several major research threads. Let's map where each innovation came from and where it leads.

MLA's Lineage

Multi-Query Attention (MQA, Shazeer 2019) first proposed sharing K,V across heads to reduce cache. Grouped-Query Attention (GQA, Ainslie et al. 2023) added groups of heads sharing K,V. MLA goes further: instead of sharing, it compresses all K,V into a joint latent. This is strictly more powerful — the up-projection can learn different K,V for each head from the same latent. MLA first appeared in DeepSeek-V2 and is retained unchanged in V3.

The MoE Thread

Shazeer et al. (2017) introduced modern MoE for Transformers with GShard. Switch Transformers (Fedus et al. 2021) simplified to top-1 routing. Mixtral (Jiang et al. 2024) demonstrated competitive dense-model performance. DeepSeekMoE (Dai et al. 2024) introduced fine-grained experts (256 small experts vs. 8 large ones) and shared experts. The auxiliary-loss-free balancing in V3 is the latest evolution — solving the quality-vs-balance trade-off that plagued all prior MoE work.

Relation to DeepSeek-R1

DeepSeek-R1 is a separate model family focused on long chain-of-thought reasoning (trained with RL to produce step-by-step solutions). V3's post-training distills R1's reasoning patterns into a standard-length output model. This "best of both worlds" approach — R1-quality reasoning without R1-length outputs — is a major practical innovation.

Relation to Kimi K2

Moonshot AI's Kimi K2 (2025) builds on many of the same ideas: MoE architecture, auxiliary-loss-free balancing (citing DeepSeek-V3), and MTP. K2 extends the approach with Muon optimizer and DAPO for RL, representing the next generation of this design family.

Cheat Sheet

AspectDeepSeek-V3
Total parameters671B (37B activated per token)
ArchitectureTransformer + MLA + DeepSeekMoE
AttentionMLA: dc=512, d'c=1536, 128 heads
MoE1 shared + 256 routed experts, top-8
Load balancingAuxiliary-loss-free (dynamic bias)
Training objectiveNTP + MTP (λ=0.3, depth=1)
Training precisionFP8 mixed (linear layers) + BF16/FP32
Pre-training tokens14.8T
Context length128K (extended from 4K via YaRN)
Training cost2.788M H800 GPU-hours ($5.57M)
Hardware2,048 H800 GPUs
ParallelismPipeline (DualPipe) + Expert (no tensor)
Training stabilityZero loss spikes, zero rollbacks
Post-trainingSFT (1.5M pairs) + GRPO + R1 distillation
Key resultsMATH-500: 90.2%, AIME: 39.2%, Codeforces: 51.6 pct
The broader lesson: DeepSeek-V3 demonstrates that frontier model performance doesn't require frontier budgets. Through careful architectural design (MLA, MoE), training innovation (FP8, DualPipe), and clever post-training (R1 distillation, GRPO), a $5.57M model can compete with systems costing orders of magnitude more. The future of large models may be less about raw scale and more about engineering efficiency.
What is the key difference between how MLA compresses the KV cache vs. how Grouped-Query Attention (GQA) reduces it?