Kimi K2 — Veanors

Chapter 0: The Problem

You want an AI assistant that can browse the web, write code, use APIs, manage files, and coordinate multi-step plans — all without a human babysitting every tool call. This is agentic intelligence: models that perceive, plan, reason, and act within complex environments.

Open-source models have gotten impressively good at reasoning and code generation. But when you ask them to actually do things — call functions, interact with sandboxed environments, recover from errors in multi-turn tool use — they fall apart. Closed-source models like Claude and GPT-4 dominate agentic benchmarks. The open-source gap is wide.

The three bottlenecks: (1) Training instability at scale. The best optimizers for token efficiency (Muon) cause attention logits to explode past 1000, triggering loss spikes and divergence in trillion-parameter models. (2) Data scarcity for agentic behavior. Real tool-use trajectories are expensive to collect. You can't just scrape the internet for "correct multi-step API call sequences." (3) Post-training misalignment. Standard SFT teaches the model to mimic demonstrations but not to recover from mistakes, explore alternatives, or improve through trial and error.

Kimi K2 attacks all three problems simultaneously. It is a 1.04 trillion-parameter Mixture-of-Experts model with 32 billion activated parameters per token, pre-trained on 15.5 trillion tokens with zero loss spikes, then post-trained through a multi-stage pipeline that includes large-scale agentic data synthesis and joint reinforcement learning across real and synthetic environments.

The numbers at a glance: 1.04T total parameters. 32B activated per token. 384 experts, 8 active per token (sparsity 48). 15.5T pre-training tokens. Zero loss spikes. 66.1 Tau2-Bench. 65.8 SWE-Bench Verified. 53.7 LiveCodeBench v6. 49.5 AIME 2025. Top-1 open-source model on LMSYS Arena (July 2025).

What is the core gap that Kimi K2 aims to close?

Open-source models lag far behind closed-source models on agentic tasks (tool use, SWE, multi-step interaction), despite strong reasoning performance Open-source models cannot be fine-tuned Transformer architectures are too slow for real-time inference

Chapter 1: The Key Insight

K2's strategy rests on three pillars, each addressing one of the bottlenecks from Chapter 0.

Pillar 1: MoE for Scale

Use a sparse Mixture-of-Experts architecture. 1T total parameters gives massive capacity, but only 32B activate per token — keeping inference cost manageable. Higher sparsity (384 experts, 8 active) consistently lowers loss for the same FLOPs.

↓

Pillar 2: MuonClip for Stability

Muon optimizer gives far better token efficiency than AdamW, but causes attention logits to explode at scale. QK-Clip rescales query/key projection weights post-update to cap logit magnitude — zero loss spikes across 15.5T tokens.

↓

Pillar 3: Agentic RL for Tool Use

SFT alone teaches imitation. RL lets the model learn through interaction — calling tools, observing results, correcting mistakes. A large-scale synthesis pipeline generates thousands of diverse tool-use trajectories with verifiable rewards.

Why not just use AdamW? Muon substantially outperforms AdamW in token efficiency — the model learns more per token seen. When high-quality data is scarce (15.5T tokens is large but finite), squeezing more learning signal per token directly translates to a better model. The problem was that Muon's aggressive updates caused training instability. MuonClip solves this without sacrificing Muon's efficiency advantage.

Why MoE over dense? Under a fixed compute budget (FLOPs), MoE models consistently outperform dense models. K2's sparsity scaling law shows that going from sparsity 8 to sparsity 48 reduces the FLOPs needed to reach the same validation loss by 1.69×. The tradeoff: more total parameters means more memory and infrastructure complexity. But you don't pay the full compute cost since only 8 of 384 experts fire per token.

Why agentic RL instead of just SFT? SFT can only teach the model to reproduce demonstrations it has seen. But agentic tasks require adaptation: the environment gives unexpected results, a tool call fails, the user changes requirements mid-conversation. RL trains the model to optimize outcomes through trial and error. The model generates trajectories, receives rewards (did the task succeed?), and improves. This is fundamentally different from copying expert behavior.

Why does K2 use the Muon optimizer instead of AdamW, despite Muon's instability at scale?

Muon has substantially better token efficiency (more learning per token), and the instability is fixed by QK-Clip without sacrificing that advantage Muon uses less GPU memory than AdamW Muon converges in fewer training steps regardless of model size

Chapter 2: The MoE Architecture

Kimi K2 is a transformer with 61 layers, but it is not a standard dense transformer. Most of its parameters live inside a Mixture-of-Experts (MoE) feedforward network. Let's trace how a single token flows through.

What is MoE?

In a dense transformer, every token passes through the same feedforward (FFN) layer — all parameters are activated. In an MoE transformer, the FFN is replaced by a set of experts (each a small FFN), and a router that selects which experts process each token. Only the selected experts activate. This means you can have far more total parameters without proportionally increasing compute.

K2's Numbers

Parameter	Kimi K2	DeepSeek-V3	Change
Total parameters	1.04T	671B	+54%
Activated per token	32.6B	37B	−13%
Total experts	384	256	+50%
Active per token	8	8	=
Shared experts	1	1	=
Attention heads	64	128	−50%
Layers	61	61	=
Hidden dim	7168	7168	=
Expert hidden dim	2048	2048	=

The routing mechanism: For each token, a gating network computes a score for each of the 384 experts. The top-8 experts are selected. Each selected expert processes the token independently, and their outputs are combined using the gating scores as weights. One additional "shared expert" always activates — it serves as a baseline that captures common patterns. So each token sees 8 routed experts + 1 shared expert = 9 expert FFNs total.

Why 384 Experts?

K2's sparsity scaling law (tested at small scale with Muon) shows that under fixed activated parameters (fixed FLOPs), more total experts = lower loss. Going from 8 to 48 sparsity cuts the FLOPs needed for the same loss by 1.69×. K2 uses sparsity 48 (384 total / 8 active), balancing performance with infrastructure complexity.

Why Only 64 Attention Heads?

DeepSeek-V3 used 128 heads. But doubling heads yields only 0.5–1.2% validation loss improvement while adding 83% more inference FLOPs at 128K context length. For agentic applications that require long contexts, this tradeoff is not worth it. K2 uses 64 heads.

Attention mechanism: K2 uses Multi-head Latent Attention (MLA), same as DeepSeek-V3. MLA compresses key-value pairs into a low-rank latent representation, dramatically reducing KV-cache memory during inference. This is essential for long-context agentic tasks where the model might process 128K tokens.

MoE Token Routing

Click "Route Token" to see how a token flows through the MoE layer. The router selects 8 of 384 experts (shown as a sample of 24). Active experts light up. Their weighted outputs combine into the final result.

K2 has 1.04 trillion total parameters but only 32.6 billion activate per token. What architectural feature enables this?

Quantization to 4-bit weights Mixture-of-Experts: a router selects only 8 of 384 experts per token, so most parameters remain inactive for any given token Pruning unused attention heads at inference

Chapter 3: MuonClip Optimizer

This is the technical heart of the paper. K2 is trained with Muon — an optimizer that learns more per token than AdamW. But Muon has a dangerous failure mode at scale: attention logit explosion. MuonClip fixes this.

What is Muon?

Muon is an optimizer that applies a Newton-Schulz iteration to the gradient momentum, effectively taking a step in the direction of the orthogonalized gradient. The update rule:

M_t = μ M_t−1 + G_t

O_t = Newton-Schulz(M_t) · √max(n, m) · 0.2

W_t = W_t−1 − η O_t + λ W_t−1

Where G_t is the gradient, μ is momentum, η is the learning rate, and λ is weight decay. The Newton-Schulz step essentially normalizes the gradient direction, and the RMS scaling factor (0.2 · √max(n,m)) matches Adam's update magnitude. This gives Muon its superior token efficiency.

The Instability Problem

As models scale past hundreds of billions of parameters, Muon's aggressive updates cause the dot products between query and key vectors to grow without bound. In a mid-scale 9B/53B MoE model trained with vanilla Muon, max attention logits exceeded 1000 within 15K steps. Logits this large cause numerical overflow in softmax, leading to loss spikes and potential divergence.

Why doesn't this happen with AdamW? AdamW normalizes each gradient element by its running second moment (the "adaptive" part). This implicitly constrains how fast any single weight can grow. Muon's orthogonalization step does not have this per-element damping — it treats the gradient as a matrix and normalizes globally. This lets individual query/key weights grow faster than softmax can handle.

Existing Mitigations (and why they fail)

Logit soft-cap clips the attention logits directly (e.g., tanh(logits/cap) × cap). But the QK dot products can still grow wildly before capping — the weights themselves are unbounded.

QK-Norm normalizes query and key vectors before computing attention. But K2 uses Multi-head Latent Attention (MLA), where key matrices are not fully materialized during inference — QK-Norm cannot be directly applied.

QK-Clip: The Solution

QK-Clip works on the weights, not the logits. After each optimizer step, it checks the max attention logit for each head. If any head exceeds a threshold τ, it rescales that head's query and key projection weights to pull the logits back below τ.

For each attention head h, the max logit is:

S^h_max = (1/√d) · max_i,j Q^h_i K^h⊤_j

If S^h_max > τ, compute the scaling factor:

γ_h = τ / S^h_max

Then rescale the weights (for MLA, only the head-specific components):

W^h_qc ← W^h_qc · √γ_h

W^h_kc ← W^h_kc · √γ_h

W^h_qr ← W^h_qr · γ_h

The shared rotary key component k^R is left untouched to avoid cross-head interference.

Why per-head, not global? In practice, only a small subset of heads exhibit exploding logits. A global clip would intervene on all heads, unnecessarily disturbing well-behaved ones. Per-head clipping is surgical: it only touches the heads that need it.

The key engineering detail: QK-Clip does NOT alter the forward/backward computation of the current step. The max logit S^h_max is computed during the forward pass anyway (it's a byproduct of attention). The weight rescaling happens after the optimizer step, before the next forward pass. This means the current step's gradients are unaffected — only future logits are constrained. It's a post-hoc correction, not a training modification.

Results: Zero Loss Spikes

K2 was trained with τ = 100. Initially, the max logits are capped at 100 by QK-Clip. Over training, the logits naturally decay to a stable range below 100 without adjusting τ. The training loss curve across all 15.5T tokens shows zero loss spikes — completely smooth.

MuonClip: Logit Capping in Action

Watch how attention logits evolve over training steps. Without QK-Clip (orange), logits explode past 1000. With QK-Clip at τ=100 (teal), logits are bounded and gradually stabilize. Adjust τ to see the effect.

τ (clip threshold) 100

Why can't QK-Norm (normalizing query/key vectors) be used to fix attention instability in K2?

K2 uses Multi-head Latent Attention (MLA) where key matrices are not fully materialized during inference, making QK-Norm inapplicable QK-Norm is too computationally expensive for 1T parameter models QK-Norm only works with the AdamW optimizer

Chapter 4: Pre-training

K2 was pre-trained on 15.5 trillion tokens spanning four domains: Web Text, Code, Mathematics, and Knowledge. The training recipe is carefully staged to maximize both quality and stability.

Training Recipe

Phase	Tokens	Context	Learning Rate
Main training (constant LR)	10T	4096	2e-4
Main training (cosine decay)	5.5T	4096	2e-4 → 2e-5
Annealing	400B	4096	2e-5 → 7e-6
Long-context activation	60B	32K	7e-6
Extended context (YaRN)	—	128K	—

Optimizer: MuonClip with τ = 100. Weight decay: 0.1. Global batch size: 67M tokens. WSD (Warmup-Stable-Decay) learning rate schedule with 500-step warmup.

Synthetic Data: Rephrasing for Token Utility

When high-quality data is limited, you need to extract maximum learning signal per token. K2 uses a rephrasing pipeline — rewriting documents in varied styles and perspectives while preserving factual content.

The pipeline works in chunks: split a long document into segments, rewrite each auto-regressively (with the full document as context), then concatenate. This preserves global coherence while creating diverse surface forms of the same knowledge.

Concrete numbers: On SimpleQA, three strategies were compared: (1) repeat raw wiki-text for 10 epochs: 23.76% accuracy. (2) Rephrase once, repeat 10 epochs: 27.39%. (3) Rephrase 10 times, train 1 epoch each: 28.94%. Rephrasing once and repeating still helps (the model sees different surface forms), but maximum diversity (10 rephrasings) gives the best result. In practice, each corpus is rephrased at most twice.

Why not just repeat data? Multi-epoch repetition leads to overfitting — the model memorizes surface patterns rather than learning concepts. Rephrasing preserves the semantic content while changing the surface form, forcing the model to generalize. It's the data equivalent of data augmentation in vision.

Infrastructure

Training cluster: NVIDIA H800 GPUs, 2 TB RAM per node, 8 GPUs per node connected by NVLink/NVSwitch, 8×400 Gbps RoCE between nodes. Parallelism: 16-way Pipeline Parallelism, 16-way Expert Parallelism, ZeRO-1 Data Parallelism. The configuration is designed to work on any multiple of 32 nodes.

Memory management: The model parameters (BF16) + gradient accumulation (FP32) require ~6 TB across a 256-GPU model-parallel group. To fit activations: (1) Selective recomputation of LayerNorm, SwiGLU, MLA up-projections, MoE down-projections. (2) FP8 storage for MoE activation inputs. (3) Remaining activations offloaded to CPU RAM with pipelined copy engines. Each GPU holds ~30 GB for all state, with the rest used for activations.

Why does K2 use a rephrasing pipeline instead of simply training for more epochs on the same data?

Rephrasing is faster than repeated training Rephrasing reduces the dataset size for efficiency Rephrasing preserves semantic content while varying surface form, forcing the model to generalize rather than memorize, yielding higher accuracy (28.9% vs 23.8% on SimpleQA)

Chapter 5: Post-training — SFT

After pre-training, K2 has broad knowledge but cannot follow instructions well. Supervised Fine-Tuning (SFT) teaches it to be a useful assistant. K2's SFT is notable for two things: using Muon (not AdamW) for fine-tuning, and a massive agentic data synthesis pipeline.

SFT Data Principles

Two guiding principles: maximize prompt diversity and ensure high response quality. Domain-specific pipelines generate candidate responses using K1.5 and specialized expert models, then LLM judges and human annotators filter for quality.

Agentic Data Synthesis Pipeline

This is where K2 gets its agentic edge. The pipeline has three stages:

Stage 1: Tool Spec Generation

Build a repository of 23,000+ tools: 3,000 real MCP tools fetched from GitHub + 20,000 synthetic tools generated through hierarchical domain evolution (financial trading → options pricing → specific tool). Each tool has clear interfaces, descriptions, and semantics.

↓

Stage 2: Agent + Task Generation

For each tool-set, generate an agent (system prompt + tool combination) and corresponding tasks with explicit rubrics (success criteria, expected tool-use patterns, evaluation checkpoints). Thousands of distinct agents with varied capabilities.

↓

Stage 3: Trajectory Generation

Simulate multi-turn interactions: LLM-generated user personas engage with agents, a tool simulator executes calls and maintains state with controlled stochasticity (successes, partial failures, edge cases). An LLM judge filters trajectories against rubrics.

The tool simulator is a world model: It doesn't just return canned responses. It maintains persistent state after each tool execution, so a "create_file" call in turn 1 means a "read_file" call in turn 3 can access that file. Controlled randomness produces realistic edge cases: rate limits, partial failures, unexpected formats. This teaches the model to handle errors, not just succeed on the happy path.

Real execution complements simulation: For coding and SWE tasks, K2 also trains on trajectories from real execution sandboxes (Kubernetes-powered, 10,000+ concurrent instances). Real sandboxes execute actual code and provide ground-truth feedback through test suite pass rates. The hybrid approach balances simulation's scalability with real execution's authenticity.

Agentic Data Pipeline

The three-stage pipeline generates diverse, high-quality tool-use trajectories at scale. Click through stages to see data flow.

Why does K2's tool simulator maintain persistent state across tool calls?

So that multi-step interactions are realistic — a file created in step 1 exists in step 3, teaching the model to track and reason about environment state changes over time To reduce memory usage by caching tool outputs To make the simulation run faster

Chapter 6: Agentic RL

SFT teaches the model to imitate. RL teaches it to improve. K2's reinforcement learning spans multiple domains with a unified objective.

The RL Objective

For each problem x, sample K responses from the current policy. Optimize:

L_RL(θ) = E_x~D [ (1/K) ∑_i ( r(x, y_i) − r̄(x) − τ log(π_θ(y_i|x) / π_old(y_i|x)) )² ]

Where r̄(x) is the mean reward across sampled responses (baseline subtraction), and τ controls the KL penalty against the previous policy. This is a variant of GRPO — no critic network needed, just sample multiple responses and use their relative rewards.

Verifiable Rewards Gym

K2 uses a Gym-like framework supporting diverse reward types:

Math/STEM: Exact answer matching with moderate-difficulty filtering (neither trivially easy nor impossibly hard)
Coding: Test suite pass rates in real sandboxes (10K+ concurrent instances)
Instruction following: Hybrid verification — code interpreters for deterministic constraints (length, style) + LLM judges for nuanced understanding + hack-check for deceptive compliance
Safety: Adversarial prompt evolution: attack model generates jailbreaks, target model responds, judge evaluates
Faithfulness: Sentence-level faithfulness judge detects unsupported factual claims

Self-Critique Rubric Reward

For tasks without verifiable answers (creative writing, open-ended QA), K2 acts as its own judge. The model generates multiple responses, then performs pairwise comparisons against rubrics that encode core values, prescriptive rules (to prevent reward hacking), and human-annotated task-specific criteria.

Closed-loop critic refinement: The self-critique model is not static. During RL, it is continuously refined using verifiable-reward signals. When K2 generates on-policy rollouts for math/code tasks (where we know the ground truth), these results are used to recalibrate the critic. This transfers objective performance signals from verifiable tasks into the critic's judgment on subjective tasks. The critic evolves alongside the policy.

RL Training Techniques

Technique	Purpose
Budget control	Per-sample max token budget based on task type. Exceeding the budget triggers a penalty. Prevents response bloat.
PTX loss	Auxiliary loss on curated high-quality samples mixed into RL training. Prevents catastrophic forgetting of valuable knowledge.
Temperature decay	High temperature early (explore diverse strategies) → low temperature late (exploit best strategies). Prevents premature convergence.

Agentic rollout infrastructure: Multi-turn agentic tasks require interacting with environments (VMs, code interpreters, browsers) during rollout. Challenges: (1) GPU idle time while waiting for environment feedback → solved by large concurrent rollout batches. (2) Long-tail trajectories blocking others → solved by partial rollout (pause long tasks, resume next iteration). (3) Environment diversity → heavy environments deployed as dedicated scalable services.

What is the key advantage of the self-critique rubric reward over a fixed reward model?

It is cheaper to compute It generates more training data The critic evolves alongside the policy through closed-loop refinement on verifiable tasks, grounding subjective judgments in objective performance signals

Chapter 7: Results

K2 is evaluated in non-thinking mode (no extended chain-of-thought, max 8192 output tokens) against both open-source and proprietary baselines. All baselines also run in non-thinking mode for fair comparison.

Headline Numbers

Benchmark	Kimi K2	DeepSeek-V3	Qwen3-235B	Claude Sonnet 4	GPT-4.1
SWE-Bench Verified	65.8	38.8	34.4	72.7	54.6
Tau2-Bench (avg)	66.1	46.9	20.9	70.6	69.6
ACEBench (En)	76.5	72.7	70.5	76.2	80.1
LiveCodeBench v6	53.7	46.9	37.0	48.5	44.7
OJBench	27.1	24.0	11.3	15.3	19.5
AIME 2025	49.5	46.7	24.7	33.1	37.0
GPQA-Diamond	75.1	68.4	62.9	70.0	66.3
MMLU	89.5	89.4	87.0	91.5	90.4

The story: K2 is the strongest open-source non-thinking model across the board. On agentic tasks (SWE-Bench, Tau2), it dramatically outperforms DeepSeek-V3 and Qwen3 while closing the gap with Claude. On coding (LiveCodeBench, OJBench) it leads all models including proprietary. On math/STEM (AIME, GPQA) it also leads. This is not a model that trades off one capability for another — it improves everywhere.

Benchmark Comparison

K2 vs. key baselines across agentic, coding, and reasoning benchmarks. Higher is better.

LMSYS Arena: On the crowdsourced LMSYS Arena leaderboard (July 17, 2025), K2 ranks #1 among open-source models and #5 overall based on 3,000+ blind user votes. This measures real-world preference across diverse, open-ended tasks — not just benchmark performance.

On which category of benchmarks does K2 show the largest improvement over previous open-source models?

Agentic tasks (tool use, SWE) — K2 scores 65.8 on SWE-Bench Verified vs 38.8 for DeepSeek-V3, a 70% relative improvement General knowledge (MMLU) Long-context retrieval

Chapter 8: Agentic Capabilities

Let's dig into what "agentic" actually means in practice, and examine K2's performance on the benchmarks that test it.

SWE-Bench Verified (65.8%)

SWE-Bench presents the model with real GitHub issues and asks it to produce a patch that resolves the issue. The "Verified" subset has human-checked test cases. K2 achieves 65.8% in single-attempt agentic mode (using bash/editor tools) and 71.6% with multiple attempts. For context, DeepSeek-V3 scores 38.8%, and this benchmark is considered a hard test of real-world software engineering.

SWE-Bench Multilingual (47.3%)

Same format but across multiple programming languages. K2 leads all models (the only comparison is Claude Sonnet 4 at 51.0%, but Claude Opus was too expensive to evaluate). This tests whether agentic capabilities generalize beyond Python.

Tau2-Bench (66.1%)

Tau2-Bench evaluates multi-turn tool-calling capabilities across domains (retail, airline, telecom). The model must interact with simulated environments, call APIs, and handle multi-step reasoning. K2 achieves 66.1 on the micro-average, with particularly strong retail (70.6) and airline (56.5) scores.

ACEBench (76.5%)

ACEBench evaluates comprehensive tool-use: understanding tool specifications, planning multi-step tool chains, handling errors. K2 scores 76.5 on the English version, competitive with GPT-4.1 (80.1) and above Claude Sonnet 4 (76.2).

What makes K2 agentic? The difference between "can answer questions about code" and "can actually fix a GitHub issue" is enormous. Fixing an issue requires: (1) reading the issue description, (2) navigating the codebase, (3) identifying the relevant files, (4) understanding the bug, (5) writing a patch, (6) testing it, (7) iterating if tests fail. This is a 10-20 step interaction with a real environment. K2's agentic RL training — with real sandboxes and synthetic tool-use trajectories — is what bridges this gap.

What degrades without agentic RL? The paper doesn't report ablations removing agentic RL specifically, but the comparison with the base model (and with competing models that lack agentic post-training) tells the story. DeepSeek-V3 has a comparable pre-training recipe but no agentic RL — it scores 38.8% vs K2's 65.8% on SWE-Bench. Qwen3-235B, also without agentic RL, scores 34.4%. The agentic training pipeline is the primary differentiator.

What distinguishes SWE-Bench from standard coding benchmarks like LiveCodeBench?

SWE-Bench uses harder programming languages SWE-Bench requires multi-step agentic interaction (navigate code, identify bugs, write patches, test) in real repositories, not just solving isolated programming problems SWE-Bench tests models on longer input contexts

Chapter 9: Connections

Kimi K2 builds on a long chain of innovations. Let's map where each piece comes from and where it leads.

Relation to DeepSeek-V3

K2's architecture is directly derived from DeepSeek-V3: same MLA attention mechanism, same MoE structure, same 61 layers. The key differences: more experts (384 vs 256), fewer attention heads (64 vs 128), and no expert grouping. K2 also uses Muon instead of AdamW, which is where MuonClip comes in.

Relation to Muon / Moonlight

Muon was introduced for its superior token efficiency. Moonlight (K2's predecessor work) demonstrated Muon at scale with consistent RMS matching and weight decay. K2 extends this with QK-Clip to handle the instability that emerges at trillion-parameter scale.

Relation to GRPO / DAPO

K2's RL objective is a variant of GRPO (Group Relative Policy Optimization): sample K responses, use their relative rewards as the signal, no separate value network. This lineage runs from PPO → GRPO → K2's objective with budget control, PTX loss, and temperature decay.

Relation to MoE Literature

The sparsity scaling law validates and extends findings from Switch Transformer and GShard: more experts with the same activated parameters consistently improve performance. K2 pushes this to sparsity 48, well beyond what prior work explored.

Cheat Sheet

Aspect	Kimi K2
Architecture	MoE transformer, MLA attention
Total params	1.04T (384 experts)
Active params	32.6B (8 experts + 1 shared)
Optimizer	MuonClip (τ = 100)
Pre-training tokens	15.5T
Loss spikes	Zero
Post-training	SFT → RL (verifiable + self-critique)
Agentic data	23K+ tools, synthetic + real trajectories
Context window	128K (YaRN extension)
Key result (agentic)	65.8 SWE-Bench, 66.1 Tau2-Bench
Key result (coding)	53.7 LiveCodeBench v6, 27.1 OJBench
Key result (math)	49.5 AIME 2025, 75.1 GPQA-Diamond
Arena ranking	#1 open-source, #5 overall (July 2025)

The broader lesson: Scaling model capacity alone is not enough. K2 shows that the training pipeline matters as much as the architecture: a stable optimizer (MuonClip) to extract maximum learning per token, a synthesis pipeline to generate agentic training data at scale, and RL to move beyond imitation to genuine interactive learning. The three pillars reinforce each other — without any one, the result would be significantly weaker.

Which prior work is K2's architecture most directly derived from?

DeepSeek-V3 — same MLA attention, same MoE structure, same 61 layers, with modifications to expert count and attention heads GPT-4 Llama 3

Kimi K2: Open Agentic Intelligence