Kimi Team — Moonshot AI, 2025

Kimi K2: Open Agentic Intelligence

A 1-trillion-parameter MoE model with 32B activated per token, trained on 15.5T tokens with zero loss spikes. MuonClip optimizer + agentic RL push open-source models to state-of-the-art on tool use, SWE, and coding.

Prerequisites: Transformers (attention) + Mixture-of-Experts basics + Reinforcement learning intuition
10
Chapters
5+
Simulations

Chapter 0: The Problem

You want an AI assistant that can browse the web, write code, use APIs, manage files, and coordinate multi-step plans — all without a human babysitting every tool call. This is agentic intelligence: models that perceive, plan, reason, and act within complex environments.

Open-source models have gotten impressively good at reasoning and code generation. But when you ask them to actually do things — call functions, interact with sandboxed environments, recover from errors in multi-turn tool use — they fall apart. Closed-source models like Claude and GPT-4 dominate agentic benchmarks. The open-source gap is wide.

The three bottlenecks: (1) Training instability at scale. The best optimizers for token efficiency (Muon) cause attention logits to explode past 1000, triggering loss spikes and divergence in trillion-parameter models. (2) Data scarcity for agentic behavior. Real tool-use trajectories are expensive to collect. You can't just scrape the internet for "correct multi-step API call sequences." (3) Post-training misalignment. Standard SFT teaches the model to mimic demonstrations but not to recover from mistakes, explore alternatives, or improve through trial and error.

Kimi K2 attacks all three problems simultaneously. It is a 1.04 trillion-parameter Mixture-of-Experts model with 32 billion activated parameters per token, pre-trained on 15.5 trillion tokens with zero loss spikes, then post-trained through a multi-stage pipeline that includes large-scale agentic data synthesis and joint reinforcement learning across real and synthetic environments.

The numbers at a glance: 1.04T total parameters. 32B activated per token. 384 experts, 8 active per token (sparsity 48). 15.5T pre-training tokens. Zero loss spikes. 66.1 Tau2-Bench. 65.8 SWE-Bench Verified. 53.7 LiveCodeBench v6. 49.5 AIME 2025. Top-1 open-source model on LMSYS Arena (July 2025).
What is the core gap that Kimi K2 aims to close?

Chapter 1: The Key Insight

K2's strategy rests on three pillars, each addressing one of the bottlenecks from Chapter 0.

Pillar 1: MoE for Scale
Use a sparse Mixture-of-Experts architecture. 1T total parameters gives massive capacity, but only 32B activate per token — keeping inference cost manageable. Higher sparsity (384 experts, 8 active) consistently lowers loss for the same FLOPs.
Pillar 2: MuonClip for Stability
Muon optimizer gives far better token efficiency than AdamW, but causes attention logits to explode at scale. QK-Clip rescales query/key projection weights post-update to cap logit magnitude — zero loss spikes across 15.5T tokens.
Pillar 3: Agentic RL for Tool Use
SFT alone teaches imitation. RL lets the model learn through interaction — calling tools, observing results, correcting mistakes. A large-scale synthesis pipeline generates thousands of diverse tool-use trajectories with verifiable rewards.
Why not just use AdamW? Muon substantially outperforms AdamW in token efficiency — the model learns more per token seen. When high-quality data is scarce (15.5T tokens is large but finite), squeezing more learning signal per token directly translates to a better model. The problem was that Muon's aggressive updates caused training instability. MuonClip solves this without sacrificing Muon's efficiency advantage.
Why MoE over dense? Under a fixed compute budget (FLOPs), MoE models consistently outperform dense models. K2's sparsity scaling law shows that going from sparsity 8 to sparsity 48 reduces the FLOPs needed to reach the same validation loss by 1.69×. The tradeoff: more total parameters means more memory and infrastructure complexity. But you don't pay the full compute cost since only 8 of 384 experts fire per token.
Why agentic RL instead of just SFT? SFT can only teach the model to reproduce demonstrations it has seen. But agentic tasks require adaptation: the environment gives unexpected results, a tool call fails, the user changes requirements mid-conversation. RL trains the model to optimize outcomes through trial and error. The model generates trajectories, receives rewards (did the task succeed?), and improves. This is fundamentally different from copying expert behavior.
Why does K2 use the Muon optimizer instead of AdamW, despite Muon's instability at scale?

Chapter 2: The MoE Architecture

Kimi K2 is a transformer with 61 layers, but it is not a standard dense transformer. Most of its parameters live inside a Mixture-of-Experts (MoE) feedforward network. Let's trace how a single token flows through.

What is MoE?

In a dense transformer, every token passes through the same feedforward (FFN) layer — all parameters are activated. In an MoE transformer, the FFN is replaced by a set of experts (each a small FFN), and a router that selects which experts process each token. Only the selected experts activate. This means you can have far more total parameters without proportionally increasing compute.

K2's Numbers

ParameterKimi K2DeepSeek-V3Change
Total parameters1.04T671B+54%
Activated per token32.6B37B−13%
Total experts384256+50%
Active per token88=
Shared experts11=
Attention heads64128−50%
Layers6161=
Hidden dim71687168=
Expert hidden dim20482048=
The routing mechanism: For each token, a gating network computes a score for each of the 384 experts. The top-8 experts are selected. Each selected expert processes the token independently, and their outputs are combined using the gating scores as weights. One additional "shared expert" always activates — it serves as a baseline that captures common patterns. So each token sees 8 routed experts + 1 shared expert = 9 expert FFNs total.

Why 384 Experts?

K2's sparsity scaling law (tested at small scale with Muon) shows that under fixed activated parameters (fixed FLOPs), more total experts = lower loss. Going from 8 to 48 sparsity cuts the FLOPs needed for the same loss by 1.69×. K2 uses sparsity 48 (384 total / 8 active), balancing performance with infrastructure complexity.

Why Only 64 Attention Heads?

DeepSeek-V3 used 128 heads. But doubling heads yields only 0.5–1.2% validation loss improvement while adding 83% more inference FLOPs at 128K context length. For agentic applications that require long contexts, this tradeoff is not worth it. K2 uses 64 heads.

Attention mechanism: K2 uses Multi-head Latent Attention (MLA), same as DeepSeek-V3. MLA compresses key-value pairs into a low-rank latent representation, dramatically reducing KV-cache memory during inference. This is essential for long-context agentic tasks where the model might process 128K tokens.
MoE Token Routing

Click "Route Token" to see how a token flows through the MoE layer. The router selects 8 of 384 experts (shown as a sample of 24). Active experts light up. Their weighted outputs combine into the final result.

K2 has 1.04 trillion total parameters but only 32.6 billion activate per token. What architectural feature enables this?

Chapter 3: MuonClip Optimizer

This is the technical heart of the paper. K2 is trained with Muon — an optimizer that learns more per token than AdamW. But Muon has a dangerous failure mode at scale: attention logit explosion. MuonClip fixes this.

What is Muon?

Muon is an optimizer that applies a Newton-Schulz iteration to the gradient momentum, effectively taking a step in the direction of the orthogonalized gradient. The update rule:

Mt = μ Mt−1 + Gt
Ot = Newton-Schulz(Mt) · √max(n, m) · 0.2
Wt = Wt−1 − η Ot + λ Wt−1

Where Gt is the gradient, μ is momentum, η is the learning rate, and λ is weight decay. The Newton-Schulz step essentially normalizes the gradient direction, and the RMS scaling factor (0.2 · √max(n,m)) matches Adam's update magnitude. This gives Muon its superior token efficiency.

The Instability Problem

As models scale past hundreds of billions of parameters, Muon's aggressive updates cause the dot products between query and key vectors to grow without bound. In a mid-scale 9B/53B MoE model trained with vanilla Muon, max attention logits exceeded 1000 within 15K steps. Logits this large cause numerical overflow in softmax, leading to loss spikes and potential divergence.

Why doesn't this happen with AdamW? AdamW normalizes each gradient element by its running second moment (the "adaptive" part). This implicitly constrains how fast any single weight can grow. Muon's orthogonalization step does not have this per-element damping — it treats the gradient as a matrix and normalizes globally. This lets individual query/key weights grow faster than softmax can handle.

Existing Mitigations (and why they fail)

Logit soft-cap clips the attention logits directly (e.g., tanh(logits/cap) × cap). But the QK dot products can still grow wildly before capping — the weights themselves are unbounded.

QK-Norm normalizes query and key vectors before computing attention. But K2 uses Multi-head Latent Attention (MLA), where key matrices are not fully materialized during inference — QK-Norm cannot be directly applied.

QK-Clip: The Solution

QK-Clip works on the weights, not the logits. After each optimizer step, it checks the max attention logit for each head. If any head exceeds a threshold τ, it rescales that head's query and key projection weights to pull the logits back below τ.

For each attention head h, the max logit is:

Shmax = (1/√d) · maxi,j Qhi Kh⊤j

If Shmax > τ, compute the scaling factor:

γh = τ / Shmax

Then rescale the weights (for MLA, only the head-specific components):

Whqc ← Whqc · √γh
Whkc ← Whkc · √γh
Whqr ← Whqr · γh

The shared rotary key component kR is left untouched to avoid cross-head interference.

Why per-head, not global? In practice, only a small subset of heads exhibit exploding logits. A global clip would intervene on all heads, unnecessarily disturbing well-behaved ones. Per-head clipping is surgical: it only touches the heads that need it.
The key engineering detail: QK-Clip does NOT alter the forward/backward computation of the current step. The max logit Shmax is computed during the forward pass anyway (it's a byproduct of attention). The weight rescaling happens after the optimizer step, before the next forward pass. This means the current step's gradients are unaffected — only future logits are constrained. It's a post-hoc correction, not a training modification.

Results: Zero Loss Spikes

K2 was trained with τ = 100. Initially, the max logits are capped at 100 by QK-Clip. Over training, the logits naturally decay to a stable range below 100 without adjusting τ. The training loss curve across all 15.5T tokens shows zero loss spikes — completely smooth.

MuonClip: Logit Capping in Action

Watch how attention logits evolve over training steps. Without QK-Clip (orange), logits explode past 1000. With QK-Clip at τ=100 (teal), logits are bounded and gradually stabilize. Adjust τ to see the effect.

τ (clip threshold) 100
Why can't QK-Norm (normalizing query/key vectors) be used to fix attention instability in K2?

Chapter 4: Pre-training

K2 was pre-trained on 15.5 trillion tokens spanning four domains: Web Text, Code, Mathematics, and Knowledge. The training recipe is carefully staged to maximize both quality and stability.

Training Recipe

PhaseTokensContextLearning Rate
Main training (constant LR)10T40962e-4
Main training (cosine decay)5.5T40962e-4 → 2e-5
Annealing400B40962e-5 → 7e-6
Long-context activation60B32K7e-6
Extended context (YaRN)128K

Optimizer: MuonClip with τ = 100. Weight decay: 0.1. Global batch size: 67M tokens. WSD (Warmup-Stable-Decay) learning rate schedule with 500-step warmup.

Synthetic Data: Rephrasing for Token Utility

When high-quality data is limited, you need to extract maximum learning signal per token. K2 uses a rephrasing pipeline — rewriting documents in varied styles and perspectives while preserving factual content.

The pipeline works in chunks: split a long document into segments, rewrite each auto-regressively (with the full document as context), then concatenate. This preserves global coherence while creating diverse surface forms of the same knowledge.

Concrete numbers: On SimpleQA, three strategies were compared: (1) repeat raw wiki-text for 10 epochs: 23.76% accuracy. (2) Rephrase once, repeat 10 epochs: 27.39%. (3) Rephrase 10 times, train 1 epoch each: 28.94%. Rephrasing once and repeating still helps (the model sees different surface forms), but maximum diversity (10 rephrasings) gives the best result. In practice, each corpus is rephrased at most twice.
Why not just repeat data? Multi-epoch repetition leads to overfitting — the model memorizes surface patterns rather than learning concepts. Rephrasing preserves the semantic content while changing the surface form, forcing the model to generalize. It's the data equivalent of data augmentation in vision.

Infrastructure

Training cluster: NVIDIA H800 GPUs, 2 TB RAM per node, 8 GPUs per node connected by NVLink/NVSwitch, 8×400 Gbps RoCE between nodes. Parallelism: 16-way Pipeline Parallelism, 16-way Expert Parallelism, ZeRO-1 Data Parallelism. The configuration is designed to work on any multiple of 32 nodes.

Memory management: The model parameters (BF16) + gradient accumulation (FP32) require ~6 TB across a 256-GPU model-parallel group. To fit activations: (1) Selective recomputation of LayerNorm, SwiGLU, MLA up-projections, MoE down-projections. (2) FP8 storage for MoE activation inputs. (3) Remaining activations offloaded to CPU RAM with pipelined copy engines. Each GPU holds ~30 GB for all state, with the rest used for activations.
Why does K2 use a rephrasing pipeline instead of simply training for more epochs on the same data?

Chapter 5: Post-training — SFT

After pre-training, K2 has broad knowledge but cannot follow instructions well. Supervised Fine-Tuning (SFT) teaches it to be a useful assistant. K2's SFT is notable for two things: using Muon (not AdamW) for fine-tuning, and a massive agentic data synthesis pipeline.

SFT Data Principles

Two guiding principles: maximize prompt diversity and ensure high response quality. Domain-specific pipelines generate candidate responses using K1.5 and specialized expert models, then LLM judges and human annotators filter for quality.

Agentic Data Synthesis Pipeline

This is where K2 gets its agentic edge. The pipeline has three stages:

Stage 1: Tool Spec Generation
Build a repository of 23,000+ tools: 3,000 real MCP tools fetched from GitHub + 20,000 synthetic tools generated through hierarchical domain evolution (financial trading → options pricing → specific tool). Each tool has clear interfaces, descriptions, and semantics.
Stage 2: Agent + Task Generation
For each tool-set, generate an agent (system prompt + tool combination) and corresponding tasks with explicit rubrics (success criteria, expected tool-use patterns, evaluation checkpoints). Thousands of distinct agents with varied capabilities.
Stage 3: Trajectory Generation
Simulate multi-turn interactions: LLM-generated user personas engage with agents, a tool simulator executes calls and maintains state with controlled stochasticity (successes, partial failures, edge cases). An LLM judge filters trajectories against rubrics.
The tool simulator is a world model: It doesn't just return canned responses. It maintains persistent state after each tool execution, so a "create_file" call in turn 1 means a "read_file" call in turn 3 can access that file. Controlled randomness produces realistic edge cases: rate limits, partial failures, unexpected formats. This teaches the model to handle errors, not just succeed on the happy path.
Real execution complements simulation: For coding and SWE tasks, K2 also trains on trajectories from real execution sandboxes (Kubernetes-powered, 10,000+ concurrent instances). Real sandboxes execute actual code and provide ground-truth feedback through test suite pass rates. The hybrid approach balances simulation's scalability with real execution's authenticity.
Agentic Data Pipeline

The three-stage pipeline generates diverse, high-quality tool-use trajectories at scale. Click through stages to see data flow.

Why does K2's tool simulator maintain persistent state across tool calls?

Chapter 6: Agentic RL

SFT teaches the model to imitate. RL teaches it to improve. K2's reinforcement learning spans multiple domains with a unified objective.

The RL Objective

For each problem x, sample K responses from the current policy. Optimize:

LRL(θ) = Ex~D [ (1/K) ∑i ( r(x, yi) − r̄(x) − τ log(πθ(yi|x) / πold(yi|x)) )2 ]

Where r̄(x) is the mean reward across sampled responses (baseline subtraction), and τ controls the KL penalty against the previous policy. This is a variant of GRPO — no critic network needed, just sample multiple responses and use their relative rewards.

Verifiable Rewards Gym

K2 uses a Gym-like framework supporting diverse reward types:

Self-Critique Rubric Reward

For tasks without verifiable answers (creative writing, open-ended QA), K2 acts as its own judge. The model generates multiple responses, then performs pairwise comparisons against rubrics that encode core values, prescriptive rules (to prevent reward hacking), and human-annotated task-specific criteria.

Closed-loop critic refinement: The self-critique model is not static. During RL, it is continuously refined using verifiable-reward signals. When K2 generates on-policy rollouts for math/code tasks (where we know the ground truth), these results are used to recalibrate the critic. This transfers objective performance signals from verifiable tasks into the critic's judgment on subjective tasks. The critic evolves alongside the policy.

RL Training Techniques

TechniquePurpose
Budget controlPer-sample max token budget based on task type. Exceeding the budget triggers a penalty. Prevents response bloat.
PTX lossAuxiliary loss on curated high-quality samples mixed into RL training. Prevents catastrophic forgetting of valuable knowledge.
Temperature decayHigh temperature early (explore diverse strategies) → low temperature late (exploit best strategies). Prevents premature convergence.
Agentic rollout infrastructure: Multi-turn agentic tasks require interacting with environments (VMs, code interpreters, browsers) during rollout. Challenges: (1) GPU idle time while waiting for environment feedback → solved by large concurrent rollout batches. (2) Long-tail trajectories blocking others → solved by partial rollout (pause long tasks, resume next iteration). (3) Environment diversity → heavy environments deployed as dedicated scalable services.
What is the key advantage of the self-critique rubric reward over a fixed reward model?

Chapter 7: Results

K2 is evaluated in non-thinking mode (no extended chain-of-thought, max 8192 output tokens) against both open-source and proprietary baselines. All baselines also run in non-thinking mode for fair comparison.

Headline Numbers

BenchmarkKimi K2DeepSeek-V3Qwen3-235BClaude Sonnet 4GPT-4.1
SWE-Bench Verified65.838.834.472.754.6
Tau2-Bench (avg)66.146.920.970.669.6
ACEBench (En)76.572.770.576.280.1
LiveCodeBench v653.746.937.048.544.7
OJBench27.124.011.315.319.5
AIME 202549.546.724.733.137.0
GPQA-Diamond75.168.462.970.066.3
MMLU89.589.487.091.590.4
The story: K2 is the strongest open-source non-thinking model across the board. On agentic tasks (SWE-Bench, Tau2), it dramatically outperforms DeepSeek-V3 and Qwen3 while closing the gap with Claude. On coding (LiveCodeBench, OJBench) it leads all models including proprietary. On math/STEM (AIME, GPQA) it also leads. This is not a model that trades off one capability for another — it improves everywhere.
Benchmark Comparison

K2 vs. key baselines across agentic, coding, and reasoning benchmarks. Higher is better.

LMSYS Arena: On the crowdsourced LMSYS Arena leaderboard (July 17, 2025), K2 ranks #1 among open-source models and #5 overall based on 3,000+ blind user votes. This measures real-world preference across diverse, open-ended tasks — not just benchmark performance.
On which category of benchmarks does K2 show the largest improvement over previous open-source models?

Chapter 8: Agentic Capabilities

Let's dig into what "agentic" actually means in practice, and examine K2's performance on the benchmarks that test it.

SWE-Bench Verified (65.8%)

SWE-Bench presents the model with real GitHub issues and asks it to produce a patch that resolves the issue. The "Verified" subset has human-checked test cases. K2 achieves 65.8% in single-attempt agentic mode (using bash/editor tools) and 71.6% with multiple attempts. For context, DeepSeek-V3 scores 38.8%, and this benchmark is considered a hard test of real-world software engineering.

SWE-Bench Multilingual (47.3%)

Same format but across multiple programming languages. K2 leads all models (the only comparison is Claude Sonnet 4 at 51.0%, but Claude Opus was too expensive to evaluate). This tests whether agentic capabilities generalize beyond Python.

Tau2-Bench (66.1%)

Tau2-Bench evaluates multi-turn tool-calling capabilities across domains (retail, airline, telecom). The model must interact with simulated environments, call APIs, and handle multi-step reasoning. K2 achieves 66.1 on the micro-average, with particularly strong retail (70.6) and airline (56.5) scores.

ACEBench (76.5%)

ACEBench evaluates comprehensive tool-use: understanding tool specifications, planning multi-step tool chains, handling errors. K2 scores 76.5 on the English version, competitive with GPT-4.1 (80.1) and above Claude Sonnet 4 (76.2).

What makes K2 agentic? The difference between "can answer questions about code" and "can actually fix a GitHub issue" is enormous. Fixing an issue requires: (1) reading the issue description, (2) navigating the codebase, (3) identifying the relevant files, (4) understanding the bug, (5) writing a patch, (6) testing it, (7) iterating if tests fail. This is a 10-20 step interaction with a real environment. K2's agentic RL training — with real sandboxes and synthetic tool-use trajectories — is what bridges this gap.
What degrades without agentic RL? The paper doesn't report ablations removing agentic RL specifically, but the comparison with the base model (and with competing models that lack agentic post-training) tells the story. DeepSeek-V3 has a comparable pre-training recipe but no agentic RL — it scores 38.8% vs K2's 65.8% on SWE-Bench. Qwen3-235B, also without agentic RL, scores 34.4%. The agentic training pipeline is the primary differentiator.
What distinguishes SWE-Bench from standard coding benchmarks like LiveCodeBench?

Chapter 9: Connections

Kimi K2 builds on a long chain of innovations. Let's map where each piece comes from and where it leads.

Relation to DeepSeek-V3

K2's architecture is directly derived from DeepSeek-V3: same MLA attention mechanism, same MoE structure, same 61 layers. The key differences: more experts (384 vs 256), fewer attention heads (64 vs 128), and no expert grouping. K2 also uses Muon instead of AdamW, which is where MuonClip comes in.

Relation to Muon / Moonlight

Muon was introduced for its superior token efficiency. Moonlight (K2's predecessor work) demonstrated Muon at scale with consistent RMS matching and weight decay. K2 extends this with QK-Clip to handle the instability that emerges at trillion-parameter scale.

Relation to GRPO / DAPO

K2's RL objective is a variant of GRPO (Group Relative Policy Optimization): sample K responses, use their relative rewards as the signal, no separate value network. This lineage runs from PPO → GRPO → K2's objective with budget control, PTX loss, and temperature decay.

Relation to MoE Literature

The sparsity scaling law validates and extends findings from Switch Transformer and GShard: more experts with the same activated parameters consistently improve performance. K2 pushes this to sparsity 48, well beyond what prior work explored.

Cheat Sheet

AspectKimi K2
ArchitectureMoE transformer, MLA attention
Total params1.04T (384 experts)
Active params32.6B (8 experts + 1 shared)
OptimizerMuonClip (τ = 100)
Pre-training tokens15.5T
Loss spikesZero
Post-trainingSFT → RL (verifiable + self-critique)
Agentic data23K+ tools, synthetic + real trajectories
Context window128K (YaRN extension)
Key result (agentic)65.8 SWE-Bench, 66.1 Tau2-Bench
Key result (coding)53.7 LiveCodeBench v6, 27.1 OJBench
Key result (math)49.5 AIME 2025, 75.1 GPQA-Diamond
Arena ranking#1 open-source, #5 overall (July 2025)
The broader lesson: Scaling model capacity alone is not enough. K2 shows that the training pipeline matters as much as the architecture: a stable optimizer (MuonClip) to extract maximum learning per token, a synthesis pipeline to generate agentic training data at scale, and RL to move beyond imitation to genuine interactive learning. The three pillars reinforce each other — without any one, the result would be significantly weaker.
Which prior work is K2's architecture most directly derived from?