A 1-trillion-parameter MoE model with 32B activated per token, trained on 15.5T tokens with zero loss spikes. MuonClip optimizer + agentic RL push open-source models to state-of-the-art on tool use, SWE, and coding.
You want an AI assistant that can browse the web, write code, use APIs, manage files, and coordinate multi-step plans — all without a human babysitting every tool call. This is agentic intelligence: models that perceive, plan, reason, and act within complex environments.
Open-source models have gotten impressively good at reasoning and code generation. But when you ask them to actually do things — call functions, interact with sandboxed environments, recover from errors in multi-turn tool use — they fall apart. Closed-source models like Claude and GPT-4 dominate agentic benchmarks. The open-source gap is wide.
Kimi K2 attacks all three problems simultaneously. It is a 1.04 trillion-parameter Mixture-of-Experts model with 32 billion activated parameters per token, pre-trained on 15.5 trillion tokens with zero loss spikes, then post-trained through a multi-stage pipeline that includes large-scale agentic data synthesis and joint reinforcement learning across real and synthetic environments.
K2's strategy rests on three pillars, each addressing one of the bottlenecks from Chapter 0.
Kimi K2 is a transformer with 61 layers, but it is not a standard dense transformer. Most of its parameters live inside a Mixture-of-Experts (MoE) feedforward network. Let's trace how a single token flows through.
In a dense transformer, every token passes through the same feedforward (FFN) layer — all parameters are activated. In an MoE transformer, the FFN is replaced by a set of experts (each a small FFN), and a router that selects which experts process each token. Only the selected experts activate. This means you can have far more total parameters without proportionally increasing compute.
| Parameter | Kimi K2 | DeepSeek-V3 | Change |
|---|---|---|---|
| Total parameters | 1.04T | 671B | +54% |
| Activated per token | 32.6B | 37B | −13% |
| Total experts | 384 | 256 | +50% |
| Active per token | 8 | 8 | = |
| Shared experts | 1 | 1 | = |
| Attention heads | 64 | 128 | −50% |
| Layers | 61 | 61 | = |
| Hidden dim | 7168 | 7168 | = |
| Expert hidden dim | 2048 | 2048 | = |
K2's sparsity scaling law (tested at small scale with Muon) shows that under fixed activated parameters (fixed FLOPs), more total experts = lower loss. Going from 8 to 48 sparsity cuts the FLOPs needed for the same loss by 1.69×. K2 uses sparsity 48 (384 total / 8 active), balancing performance with infrastructure complexity.
DeepSeek-V3 used 128 heads. But doubling heads yields only 0.5–1.2% validation loss improvement while adding 83% more inference FLOPs at 128K context length. For agentic applications that require long contexts, this tradeoff is not worth it. K2 uses 64 heads.
Click "Route Token" to see how a token flows through the MoE layer. The router selects 8 of 384 experts (shown as a sample of 24). Active experts light up. Their weighted outputs combine into the final result.
This is the technical heart of the paper. K2 is trained with Muon — an optimizer that learns more per token than AdamW. But Muon has a dangerous failure mode at scale: attention logit explosion. MuonClip fixes this.
Muon is an optimizer that applies a Newton-Schulz iteration to the gradient momentum, effectively taking a step in the direction of the orthogonalized gradient. The update rule:
Where Gt is the gradient, μ is momentum, η is the learning rate, and λ is weight decay. The Newton-Schulz step essentially normalizes the gradient direction, and the RMS scaling factor (0.2 · √max(n,m)) matches Adam's update magnitude. This gives Muon its superior token efficiency.
As models scale past hundreds of billions of parameters, Muon's aggressive updates cause the dot products between query and key vectors to grow without bound. In a mid-scale 9B/53B MoE model trained with vanilla Muon, max attention logits exceeded 1000 within 15K steps. Logits this large cause numerical overflow in softmax, leading to loss spikes and potential divergence.
Logit soft-cap clips the attention logits directly (e.g., tanh(logits/cap) × cap). But the QK dot products can still grow wildly before capping — the weights themselves are unbounded.
QK-Norm normalizes query and key vectors before computing attention. But K2 uses Multi-head Latent Attention (MLA), where key matrices are not fully materialized during inference — QK-Norm cannot be directly applied.
QK-Clip works on the weights, not the logits. After each optimizer step, it checks the max attention logit for each head. If any head exceeds a threshold τ, it rescales that head's query and key projection weights to pull the logits back below τ.
For each attention head h, the max logit is:
If Shmax > τ, compute the scaling factor:
Then rescale the weights (for MLA, only the head-specific components):
The shared rotary key component kR is left untouched to avoid cross-head interference.
K2 was trained with τ = 100. Initially, the max logits are capped at 100 by QK-Clip. Over training, the logits naturally decay to a stable range below 100 without adjusting τ. The training loss curve across all 15.5T tokens shows zero loss spikes — completely smooth.
Watch how attention logits evolve over training steps. Without QK-Clip (orange), logits explode past 1000. With QK-Clip at τ=100 (teal), logits are bounded and gradually stabilize. Adjust τ to see the effect.
K2 was pre-trained on 15.5 trillion tokens spanning four domains: Web Text, Code, Mathematics, and Knowledge. The training recipe is carefully staged to maximize both quality and stability.
| Phase | Tokens | Context | Learning Rate |
|---|---|---|---|
| Main training (constant LR) | 10T | 4096 | 2e-4 |
| Main training (cosine decay) | 5.5T | 4096 | 2e-4 → 2e-5 |
| Annealing | 400B | 4096 | 2e-5 → 7e-6 |
| Long-context activation | 60B | 32K | 7e-6 |
| Extended context (YaRN) | — | 128K | — |
Optimizer: MuonClip with τ = 100. Weight decay: 0.1. Global batch size: 67M tokens. WSD (Warmup-Stable-Decay) learning rate schedule with 500-step warmup.
When high-quality data is limited, you need to extract maximum learning signal per token. K2 uses a rephrasing pipeline — rewriting documents in varied styles and perspectives while preserving factual content.
The pipeline works in chunks: split a long document into segments, rewrite each auto-regressively (with the full document as context), then concatenate. This preserves global coherence while creating diverse surface forms of the same knowledge.
Training cluster: NVIDIA H800 GPUs, 2 TB RAM per node, 8 GPUs per node connected by NVLink/NVSwitch, 8×400 Gbps RoCE between nodes. Parallelism: 16-way Pipeline Parallelism, 16-way Expert Parallelism, ZeRO-1 Data Parallelism. The configuration is designed to work on any multiple of 32 nodes.
After pre-training, K2 has broad knowledge but cannot follow instructions well. Supervised Fine-Tuning (SFT) teaches it to be a useful assistant. K2's SFT is notable for two things: using Muon (not AdamW) for fine-tuning, and a massive agentic data synthesis pipeline.
Two guiding principles: maximize prompt diversity and ensure high response quality. Domain-specific pipelines generate candidate responses using K1.5 and specialized expert models, then LLM judges and human annotators filter for quality.
This is where K2 gets its agentic edge. The pipeline has three stages:
The three-stage pipeline generates diverse, high-quality tool-use trajectories at scale. Click through stages to see data flow.
SFT teaches the model to imitate. RL teaches it to improve. K2's reinforcement learning spans multiple domains with a unified objective.
For each problem x, sample K responses from the current policy. Optimize:
Where r̄(x) is the mean reward across sampled responses (baseline subtraction), and τ controls the KL penalty against the previous policy. This is a variant of GRPO — no critic network needed, just sample multiple responses and use their relative rewards.
K2 uses a Gym-like framework supporting diverse reward types:
For tasks without verifiable answers (creative writing, open-ended QA), K2 acts as its own judge. The model generates multiple responses, then performs pairwise comparisons against rubrics that encode core values, prescriptive rules (to prevent reward hacking), and human-annotated task-specific criteria.
| Technique | Purpose |
|---|---|
| Budget control | Per-sample max token budget based on task type. Exceeding the budget triggers a penalty. Prevents response bloat. |
| PTX loss | Auxiliary loss on curated high-quality samples mixed into RL training. Prevents catastrophic forgetting of valuable knowledge. |
| Temperature decay | High temperature early (explore diverse strategies) → low temperature late (exploit best strategies). Prevents premature convergence. |
K2 is evaluated in non-thinking mode (no extended chain-of-thought, max 8192 output tokens) against both open-source and proprietary baselines. All baselines also run in non-thinking mode for fair comparison.
| Benchmark | Kimi K2 | DeepSeek-V3 | Qwen3-235B | Claude Sonnet 4 | GPT-4.1 |
|---|---|---|---|---|---|
| SWE-Bench Verified | 65.8 | 38.8 | 34.4 | 72.7 | 54.6 |
| Tau2-Bench (avg) | 66.1 | 46.9 | 20.9 | 70.6 | 69.6 |
| ACEBench (En) | 76.5 | 72.7 | 70.5 | 76.2 | 80.1 |
| LiveCodeBench v6 | 53.7 | 46.9 | 37.0 | 48.5 | 44.7 |
| OJBench | 27.1 | 24.0 | 11.3 | 15.3 | 19.5 |
| AIME 2025 | 49.5 | 46.7 | 24.7 | 33.1 | 37.0 |
| GPQA-Diamond | 75.1 | 68.4 | 62.9 | 70.0 | 66.3 |
| MMLU | 89.5 | 89.4 | 87.0 | 91.5 | 90.4 |
K2 vs. key baselines across agentic, coding, and reasoning benchmarks. Higher is better.
Let's dig into what "agentic" actually means in practice, and examine K2's performance on the benchmarks that test it.
SWE-Bench presents the model with real GitHub issues and asks it to produce a patch that resolves the issue. The "Verified" subset has human-checked test cases. K2 achieves 65.8% in single-attempt agentic mode (using bash/editor tools) and 71.6% with multiple attempts. For context, DeepSeek-V3 scores 38.8%, and this benchmark is considered a hard test of real-world software engineering.
Same format but across multiple programming languages. K2 leads all models (the only comparison is Claude Sonnet 4 at 51.0%, but Claude Opus was too expensive to evaluate). This tests whether agentic capabilities generalize beyond Python.
Tau2-Bench evaluates multi-turn tool-calling capabilities across domains (retail, airline, telecom). The model must interact with simulated environments, call APIs, and handle multi-step reasoning. K2 achieves 66.1 on the micro-average, with particularly strong retail (70.6) and airline (56.5) scores.
ACEBench evaluates comprehensive tool-use: understanding tool specifications, planning multi-step tool chains, handling errors. K2 scores 76.5 on the English version, competitive with GPT-4.1 (80.1) and above Claude Sonnet 4 (76.2).
Kimi K2 builds on a long chain of innovations. Let's map where each piece comes from and where it leads.
K2's architecture is directly derived from DeepSeek-V3: same MLA attention mechanism, same MoE structure, same 61 layers. The key differences: more experts (384 vs 256), fewer attention heads (64 vs 128), and no expert grouping. K2 also uses Muon instead of AdamW, which is where MuonClip comes in.
Muon was introduced for its superior token efficiency. Moonlight (K2's predecessor work) demonstrated Muon at scale with consistent RMS matching and weight decay. K2 extends this with QK-Clip to handle the instability that emerges at trillion-parameter scale.
K2's RL objective is a variant of GRPO (Group Relative Policy Optimization): sample K responses, use their relative rewards as the signal, no separate value network. This lineage runs from PPO → GRPO → K2's objective with budget control, PTX loss, and temperature decay.
The sparsity scaling law validates and extends findings from Switch Transformer and GShard: more experts with the same activated parameters consistently improve performance. K2 pushes this to sparsity 48, well beyond what prior work explored.
| Aspect | Kimi K2 |
|---|---|
| Architecture | MoE transformer, MLA attention |
| Total params | 1.04T (384 experts) |
| Active params | 32.6B (8 experts + 1 shared) |
| Optimizer | MuonClip (τ = 100) |
| Pre-training tokens | 15.5T |
| Loss spikes | Zero |
| Post-training | SFT → RL (verifiable + self-critique) |
| Agentic data | 23K+ tools, synthetic + real trajectories |
| Context window | 128K (YaRN extension) |
| Key result (agentic) | 65.8 SWE-Bench, 66.1 Tau2-Bench |
| Key result (coding) | 53.7 LiveCodeBench v6, 27.1 OJBench |
| Key result (math) | 49.5 AIME 2025, 75.1 GPQA-Diamond |
| Arena ranking | #1 open-source, #5 overall (July 2025) |