Kimi K2.7-Code: the 1T parameter open-source coding model that cuts 30% of reasoning tokens and beats Opus in tool use
🔎 Two major open-source coding models in 72 hours — China isn't slowing down
On June 12, 2026, Moonshot AI releases Kimi K2.7-Code. Just two days earlier, Qwen3 Coder Next arrived with its promise of running on a 64GB Mac. The pace has become infernal: two frontier-class open-weight models for code, released almost simultaneously, both coming from Chinese labs.
The stakes go beyond a simple product announcement. Kimi K2.7-Code pushes a numerical argument that hurts proprietary models: 30% fewer reasoning tokens for a superior result compared to its predecessor, and a price per token up to 12 times lower than GPT-5.5 or Claude Opus 4.8.
The real question is no longer "open-source vs. closed on quality?" but "does the saved budget make up for the remaining gap?" — and that gap shrinks with every release.
The essentials
- Architecture: MoE with 1 trillion total parameters, 32B activated per token, 384 experts, 256K token context window.
- Performance: Score of 62.0 on Kimi Code Bench v2, representing a +21.8% increase over K2.6 (50.9). 81.1% on MCPMark Verified, ahead of several frontier closed models in tool use.
- Efficiency: ~30% reduction in reasoning tokens thanks to a completely revamped reward model and data pipeline.
- Pricing: $0.95/$4.00 per million input/output tokens via the Kimi API, $0.75/$3.50 via OpenRouter (June 2026, verify on openrouter.ai).
- License: Modified MIT, weights available on HuggingFace.
Recommended tools
| Outil | Main usage | Price (June 2026, check the website) | Ideal for |
|---|---|---|---|
| OpenRouter | K2.7-Code API access | $0.75/$3.50 per M tokens | Developers who want to test without setup |
| HuggingFace | Weight download | Free (self-hosting) | Teams with GPU infra |
| API Kimi | Direct model access | $0.95/$4.00 per M tokens | Production integration in China |
Architecture: 1 trillion parameters, but only 32B activated per token
Kimi K2.7-Code is based on a massive Mixture of Experts (MoE) architecture: 1 trillion parameters in total spread across 384 experts, but only 32 billion are activated for each generated token. This is the same logic as with large MoE models: total capacity is huge, while inference cost remains controlled.
The model is built directly on Kimi K2.6, Moonshot AI's generalist model that scores 88.1 on agentic benchmarks (according to llm-stats.com, June 2026). But K2.7-Code is not just a simple light fine-tune. Moonshot AI completely reworked the data pipeline and the reward model around real, long-horizon coding tasks.
The context window goes up to 256K tokens, enough to ingest entire codebases or extended debugging sessions. Thinking mode is native: the model reasons before coding, but it does so with 30% fewer tokens than K2.6 on the same tasks. That means less noise, more direct answers, and above all, an API bill that goes down accordingly.
For the monthly comparison of the best LLMs, K2.7-Code falls into a specific category: open-weight coding specialists, neither generalists like GPT-5.5, nor purely local like small models.
Coding benchmarks: +21.8% on Kimi Code Bench v2
The figures are clear and sourced. On the Kimi Code Bench v2, Moonshot AI's internal benchmark evaluating real-world programming tasks, K2.7-Code reaches 62.0 compared to 50.9 for K2.6. That is a +21.8% improvement in a single iteration (Codersera, June 2026).
For context: this benchmark measures the ability to complete end-to-end tasks — not just generating a snippet, but understanding context, navigating a repo, and producing functional code. A jump of more than 11 points on this scale is unusual for a model of this generation.
The model was evaluated with 248,000+ behavioral tests generated by fuzzing, according to the HuggingFace model card published on June 12, 2026. This testing method is more robust than static benchmarks because it covers edge cases and unforeseen scenarios.
That said, K2.7-Code still trails GPT-5.5 (98.2 on agentic benchmarks according to llm-stats.com) and Claude Opus 4.8 in terms of raw score on complex coding tasks. The gap exists. But it is narrowing, and the price-to-performance ratio changes the game.
Tool use : 81.1% on MCPMark Verified, ahead of closed models
This is perhaps the most surprising figure of this release. On MCPMark Verified, the benchmark of reference for measuring a model's ability to use external tools via the MCP (Model Context Protocol) protocol, K2.7-Code reaches 81.1%.
This score places it ahead of several frontier closed models in agentic tool use. Concretely, this means that K2.7-Code is capable of calling APIs, navigating file systems, interacting with build and deployment tools, all reliably.
For the best LLMs for AI agents, this score is decisive. A coding agent doesn't just generate text: it reads files, executes commands, checks results, iterates. Tool use capability is the main limiting factor of current coding agents, more than the raw quality of the generated code.
According to EmpirioLabs (June 2026), this result on MCPMark confirms that Moonshot AI's work on the tool use-focused reward model has paid off. The model was specifically trained on real tool interaction traces, not just on static code.
Pricing: up to 12 times cheaper than frontier closed models
The prices speak for themselves. Via the Kimi API, K2.7-Code costs $0.95 per million input tokens and $4.00 for output. Via OpenRouter, it's even cheaper: $0.75/$3.50 (June 2026, check on openrouter.ai).
The Decoder (June 2026) calculates that this represents up to 12 times cheaper than GPT-5.5 or Claude Opus 4.8 on a per-token basis. Handy AI Substack positions K2.7-Code as "the budget answer to Fable 5 ($10/$50)", roughly 4 times cheaper on output tokens than frontier closed models.
But the real calculation isn't "same result for cheaper". It's: "with the same budget, how many additional iterations can you do?" If a coding agent needs 5 cycles of thinking-action-verification to solve a complex bug, and each cycle costs 12 times less, you can either drastically reduce your costs or multiply your attempts to achieve a success rate equivalent to closed models.
This is exactly the question raised by The Decoder: do the additional runs for the same budget make up for the quality gap? For many production use cases, the answer tends towards yes.
For teams comparing the meilleurs LLM gratuits or low-cost options, K2.7-Code becomes a serious option even against Freemium models.
K2.7-Code vs Qwen3 Coder Next : two visions of open-source coding
The near-simultaneous release creates an inevitable comparison. Qwen3 Coder Next arrived two days earlier, with a different positioning: a model optimized to run locally on consumer hardware, notably a 64 GB Mac.
K2.7-Code, on the other hand, is not targeting local. With 1T total parameters and 32B active, it requires serious GPU infrastructure for self-hosting. Its playground is the API and the cloud.
According to AIMadeTools (June 2026), the comparison is clear-cut: in pure coding, K2.7-Code leads. In tool use, K2.7-Code leads clearly (81.1% MCPMark). In heavy and generalist reasoning, Qwen 3.7 leads (92.4% GPQA). Two philosophies: the agentic specialist vs the generalist that also codes.
To choose between the best LLMs for coding, it all depends on the workflow. If you are building an agent that needs to call tools, read repos, and iterate — K2.7-Code has the advantage. If you want a local model that runs on your machine without an external GPU — Qwen3 Coder Next is more suitable.
The landscape of the best open-source LLMs is enriched by two complementary options rather than competing ones.
Comparison with closed models: GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash
K2.7-Code deliberately positions itself against closed models on the cost front. But what about raw quality?
GPT-5.5 dominates agentic benchmarks with 98.2 (llm-stats.com, June 2026). Claude Opus 4.8 remains the gold standard for complex code in "max effort" mode in Claude Code. Gemini 3.5 Flash even beats Opus 4.7 and GPT-5.5 on certain agent benchmarks with 289 tokens/second.
Against these heavyweights, K2.7-Code makes no claim to win on raw score. It wins on value for money, and on a specific point: tool use, where its MCPMark score of 81.1% places it at the top of the pack across all categories.
Totalum (June 2026) offers a pragmatic reading: in production integration for app builders, the winning recipe is no longer "one expensive frontier model" but "a pipeline that combines K2.7-Code for repetitive tool use tasks and a frontier model for critical decisions". This is the routing/cascading approach that is becoming widespread.
For the best AI tools for code like Cursor or Cline, integrating K2.7-Code as a secondary model — for autocompletion, unit tests, refactoring — is an obvious use case.
Expanded Competition: MiniMax M3, DeepSeek V4-Pro, the mid-2026 coding landscape
K2.7-Code is not isolated. Flowtivity (June 2026) notes the near-simultaneous release of MiniMax M3, another Chinese open-source coding model with a different architecture and focus. Kilo.ai (June 2026) provides a comprehensive overview of 2026 open-source coding models that includes GLM-5.1, MiniMax M3, Kimi K2.6, DeepSeek V4-Pro, V4-Flash, and Qwen3-Coder-Next for agentic work.
For agents that search, code, and create over the long term, solutions like ByteDance's DeerFlow rely precisely on this new generation of open-weight coding models.
The strong signal: China is now producing open-weight coding models at a rate of one per week. The open-source LLM war has changed in nature — it is no longer about catching up with closed models, but about surrounding them through specialization and price.
Self-hosting and integration: what you need to know
Self-hosting K2.7-Code is possible — the weights are on HuggingFace under a modified MIT license. But it's clearly a model designed for the API rather than for local use. With 32B parameters activated per token, you need at least a machine with 2-3 high-end GPUs (A100 80GB or equivalent) for comfortable batch 1 inference.
For the local LLM installation guide via Ollama or LM Studio, K2.7-Code is not the best candidate. If local is your constraint, Qwen3.6-27B (mentioned by Kilo.ai as the best model for local development) or the best LLMs to run locally are more suitable.
On the other hand, for open-source AI agents with Ollama, the hybrid architecture becomes interesting: a lightweight local model for simple tasks, and K2.7-Code via API for heavy agentic tasks requiring tool use.
API integration is standard: OpenAI compatibility, availability on OpenRouter, endpoints documented by Kimi. Thinking mode is enabled by default — no special configuration necessary.
Modified MIT license: what's new?
The license is a point of attention. Moonshot AI uses a "modified MIT", which means the weights are open and freely usable, with certain restrictions compared to the standard MIT. Sources differ on the exact details of these restrictions.
What is clear: it is more open than closed models (GPT-5.5, Claude Opus 4.8), but potentially more restrictive than the pure MIT applied by some competitors. For production use, reading the license on the HuggingFace page is essential before deploying.
Kimi Claw 24/7 Bench : the agentic persistence test
A specific benchmark deserves attention: the Kimi Claw 24/7 Bench. It evaluates a model's ability to maintain persistent agentic tasks over several days — a real-world scenario for development agents that must resume a context after an interruption.
The HuggingFace card mentions comparisons with Claude Opus 4.8 in "max effort" setting in Claude Code on this benchmark. The exact details of the scores are not public in the sources consulted, but the very existence of this benchmark in Moonshot AI's communication indicates the direction: coding models are no longer evaluated on unitary tasks, but on their ability to function as autonomous workers over the long term.
This is consistent with the trend of meilleurs LLM pour la recherche and agents that must maintain reasoning over extended periods.
❌ Common mistakes
Mistake 1: Comparing K2.7-Code to GPT-5.5 on raw score alone
What's wrong: looking at the agentic leaderboard (98.2 vs an unranked score) and concluding that K2.7-Code is useless. The solution: evaluate based on price/performance for your specific use case. An agent that performs 50 tool use iterations per task does not consume the same budget as a single prompt.
Mistake 2: Trying to run K2.7-Code locally on a Mac
What's wrong: 32B parameters activated per token is not a local model. Even with aggressive quantization, the experience will be degraded. The solution: use the Kimi API or OpenRouter, and reserve local execution for models explicitly sized for that.
Mistake 3: Ignoring thinking mode and treating K2.7-Code as a classic completion model
What's wrong: the model is designed to reason before coding. Short-circuiting it by disabling thinking significantly reduces its performance, especially in tool use. The solution: leave thinking mode active and budget accordingly — the 30% savings are already calculated relative to K2.6.
Mistake 4: Assuming "open-weight" means "without restrictions"
What's wrong: the modified MIT license may contain clauses limiting commercial use or redistribution in certain contexts. The solution: read the full license on HuggingFace before any production deployment.
❓ Frequently Asked Questions
Is Kimi K2.7-Code really open-source?
The weights are open-weight under a modified MIT license, available on HuggingFace. The training code is not published. This is the current standard for Chinese "open-source" models — opening the weights, not the complete pipeline.
Can K2.7-Code be used with Cursor or Copilot?
Via API, yes — the model is OpenAI-compatible. You need to configure a custom endpoint in your IDE. This is relevant for refactoring or test generation tasks where the cost per token matters more than the absolute score.
What is the real advantage of 30% fewer reasoning tokens?
Fewer reasoning tokens = faster responses, reduced costs, and less noise in the context. For an agent chaining dozens of iterations, this saving multiplies and becomes significant over a work session.
K2.7-Code or Qwen3 Coder Next for a solo developer?
If you are working locally without an external GPU: Qwen3 Coder Next. If you can pay for the API and need reliable tool use: K2.7-Code. Both models target different workflows.
Is the MCPMark score of 81.1% comparable to other benchmarks?
MCPMark Verified is a benchmark specific to tool use via the MCP protocol. It measures the reliability of tool calls, not the quality of the generated code. A good MCPMark score means the agent does not "miss" its tool calls — which is critical for automated agentic workflows.
✅ Conclusion
Kimi K2.7-Code does not beat GPT-5.5 on the raw score, but it changes the economic equation of agentic coding: 1T parameters, tool use at 81.1% on MCPMark, and a price up to 12x lower than closed models. For teams building coding agents in production, it is the first model to test via OpenRouter before budgeting for a frontier model.