Qwen3-Coder-Next: 80B MoE with 3B active, the open-source coding agent that rivals Claude Sonnet
🔎 Why an 80-billion parameter model runs on a Mac
Alibaba just released Qwen3-Coder-Next, an 80-billion parameter coding model that only activates 3 billion for each generated token. The result: 70.6% on SWE-Bench Verified, a score that places this open-source model on par with Claude Sonnet 4.5 on real bug-fixing tasks.
The stroke of genius is the ultra-sparse MoE (Mixture of Experts) architecture. Instead of passing every token through the entire 80 billion parameters, the model dynamically selects the 3 billion most relevant ones. This divides the computational load by a factor of over 25, all while maintaining the reasoning depth of a massive model.
The impact is immediate: this model runs on consumer hardware. No need for a $10,000/month GPU cluster. A Mac Studio M2 Ultra with 64GB of RAM, or a Linux workstation with a RTX 4090, is enough to run it locally. It's a paradigm shift for AI-assisted development.
The essentials
- Architecture: 80B MoE, 3B active parameters per token, 256K token context, Apache 2.0 license
- Performance: 70.6% on SWE-Bench Verified, best score among all locally executable models
- Training: Reinforcement Learning from environment feedback (code execution, unit tests)
- API Price: $0.11/M input tokens, $0.80/M output tokens on OpenRouter (May 2026, check on openrouter.ai)
- Family: Three sizes available — 30B-A3B, 80B-A3B (Next) and 480B-A35B
Recommended tools
| Tool | Main usage | Price (May 2026) | Ideal for |
|---|---|---|---|
| Qwen3-Coder-Next | Local agentic coding | Free (open-weight) | Devs with 32-64 GB RAM |
| Qwen3-Coder CLI | Terminal coding agent | Free (Apache 2.0) | Automated workflows |
| OpenRouter — Qwen3-Coder-Next | API coding agent | $0.11/M in, $0.80/M out | Production integration |
MoE Architecture: how 3 billion rival 200
The short answer: the model never works at full capacity, but it always chooses the right experts.
A dense model of 3 billion parameters like Qwen3.6-27B or Qwen3.5-27B (which score 74 and 63 in agentic, respectively) has limited representational capacity. It cannot store everything in weight memory. Qwen3-Coder-Next solves this problem by distributing the 80 billion parameters across specialized "experts" — each covering a specific area of code (sorting algorithms, React APIs, SQL queries, etc.).
At each token, a routing mechanism selects the most relevant experts. Only the weights of these experts are loaded into active memory. The remaining 77 billion parameters stay inactive. It's like having a library of 80,000 books but only opening 3 at a time — the exact ones you need.
The 256K token context allows for loading entire repositories. An average Python project of 50 files easily fits within this window. The model can therefore reason about cross-dependencies and multi-file architectures without losing the big picture.
This MoE approach explains why Qwen3-Coder-Next clearly outperforms dense models of similar size in terms of required VRAM. To compare the best LLMs for coding, you now have to distinguish between dense models and MoE models — VRAM metrics no longer mean the same thing.
SWE-Bench Verified : 70.6% broken down
The 70.6% score on SWE-Bench Verified is not a synthetic benchmark. It measures a model's ability to solve real bugs from popular open-source repositories (Django, Scikit-learn, Flask). The model receives a bug description, accesses the source code, proposes a patch, and the patch is automatically tested.
70.6% means that out of 100 real bugs, Qwen3-Coder-Next autonomously solves more than 70. To put this in perspective, the best agentic LLMs like GPT-5.5 (98.2) and Gemini 3 Pro Deep Think (95.4) remain far ahead — but these are proprietary models running on GPU clusters.
In the "locally runnable" category, Qwen3-Coder-Next is king. No other open-source model exceeds this score on consumer hardware. Claude Sonnet 4.6, which scores 81.4 in agentic, is not runnable locally. The comparison is therefore clear: for self-hosting, Qwen3-Coder-Next is the best available choice.
The score climbs even higher with agent scaffolding — that is, when the model is wrapped in an agent that can iterate, read compilation errors, and rerun tests. This is precisely the workflow for which Qwen3-Coder-Next was optimized.
RL Training with Environment Feedback
The fundamental difference between Qwen3-Coder-Next and classic coding models is the training method. Alibaba didn't just do supervised fine-tuning on (instruction, code) pairs. They used Reinforcement Learning where the reward comes from the actual execution of the code.
The model generates a patch. The patch is applied. The repository's unit tests execute. If the tests pass, the model receives a positive reward. If the tests fail, it receives a negative signal with the exact traceback. This feedback loop teaches the model not just to write code that looks correct, but code that works.
This is a major shift from models trained solely on code completions. These tend to produce syntactically valid but semantically incorrect code. RL with environment feedback corrects this bias.
This approach aligns with trends seen in other open-source agent projects like DeerFlow de ByteDance : l'agent open-source qui recherche, code et cree sur le long terme, which also combines execution and iteration for complex tasks. The pattern repeats itself: the best coding agents are those that can test their own output.
Local Execution: What You Really Need
Minimum Hardware Configuration
Qwen3-Coder-Next in 4-bit quantization (GGUF) requires about 40 to 45 GB of VRAM/unified RAM for the model alone. With the 256K context and KV cache, expect 50-55 GB. This runs on a Mac Studio M2 Ultra (64 GB), a Mac Pro M2 Ultra (192 GB), or a PC with two RTX 3090/4090s (48 GB combined VRAM).
On 32 GB, it is possible in 2-bit quantization (EXL2 or GGUF Q2_K), but the quality degrades noticeably. The model loses about 3 to 5 points on SWE-Bench compared to the 4-bit version. For the meilleurs LLM à run en local, the 64 GB threshold is therefore the recommended baseline to run Qwen3-Coder-Next properly.
For a step-by-step guide, the method remains identical to the one described in our guide d'installation LLM local: Ollama or LM Studio as a backend, then connection via the Qwen3-Coder CLI or via an IDE.
Deployment via Ollama and the Official CLI
The QwenLM/Qwen3-Coder GitHub repo provides a dedicated CLI that handles the complete agent cycle: reading the repository, generating patches, running tests, and iterating. The typical workflow looks like this: you point the CLI to a local Git repository, describe the bug or feature, and the agent iterates until it is resolved.
For the meilleurs modèles Ollama, Qwen3-Coder-Next now joins the top tier alongside Llama 4 and Qwen3. If you configure des agents IA open-source avec Ollama en local, this model is now the natural candidate for agentic coding tasks.
The Complete Qwen3-Coder Family
Alibaba did not release a single model but an entire lineup. Besides the 80B-A3B (Next), there is the Qwen3-Coder-30B-A3B for more modest machines (24 GB VRAM is enough in 4-bit), and the Qwen3-Coder-480B-A35B for server deployments with heavy scaffolding. The latter, with 35B active parameters, aims for the performance of Claude Opus 4.6 (84.7) but requires serious multi-GPU infrastructure.
API Costs: the price-to-performance ratio crushes the competition
On OpenRouter, Qwen3-Coder-Next costs $0.11/million tokens in input and $0.80/million in output (May 2026, check on openrouter.ai). For a typical agentic bug-fixing workflow — which consumes about 50K input tokens (the repo + context) and 10K output tokens (the patch + reasoning) — that comes out to about $0.013 per task.
Compare this with an equivalent workflow via Claude Sonnet 4.6 on the Anthropic API: about $0.15 per task, or more than 10 times more expensive. For a team solving 200 bugs/week via agents, the difference amounts to thousands of dollars per month.
The model remains, of course, free locally — you only pay for electricity. This is where the price-to-performance ratio becomes absurd: a model that rivals a $20/month Claude Pro subscription, running for free on your own machine. If you're looking for meilleurs LLM gratuits, Qwen3-Coder-Next redefines the category.
Positioning in the AI ecosystem of May 2026
The LLM landscape is fragmented into several layers. At the top, proprietary agentic models (GPT-5.5 at 98.2, Gemini 3 Pro Deep Think at 95.4, Claude Opus 4.7 at 94.3) dominate complex benchmarks. In the middle, general open-source models like DeepSeek V4 Pro (88 in open-source), Kimi K2.6 (85) and GLM-5.1 (83) offer excellent general performance.
Qwen3-Coder-Next does not seek to compete on the general agentic benchmark. It is specialized — and that is its strength. A generalist model like DeepSeek V4 Pro High (84) will be better for writing an email or analyzing a PDF. But on SWE-Bench, the specialized model wins because all of its capacity is optimized for code.
This positioning is reminiscent of the strategy of GenericAgent : l agent IA open-source qui construit son propre arbre de competences, which bets on progressive specialization rather than a single generalist model. The future of agentic coding is not a single model that does everything, but an ecosystem of specialists.
For the comparatif Claude vs ChatGPT, the arrival of Qwen3-Coder-Next adds a third player. The question is no longer just "Claude or GPT for coding?" but "why pay for either when an open-source model does the job locally?"
Concrete use cases: when Qwen3-Coder-Next shines
Bug resolution in existing repositories
This is the main use case for which the model was designed. You clone a repository, launch the Qwen3-Coder CLI, and describe the bug. The agent reads the code, identifies the source of the problem, generates a patch, tests it, and iterates. On SWE-Bench, it works 70% of the time. In real life, with repositories less complex than those in the benchmark, the success rate is often higher.
Code refactoring and migration
The 256K context allows you to load an entire module and request coherent refactoring — for example, migrating from a deprecated API to the new one, or restructuring a monolith into modules. The model understands cross-dependencies and produces refactoring that compiles on the first try, thanks to its RL training.
Local pair programming without latency
Unlike APIs where each request adds 200-500ms of network latency, local execution offers response times under 100ms in optimized inference. For interactive pair programming in VS Code or Cursor, this responsiveness makes the difference between a smooth assistant and one that breaks your flow.
If your machine doesn't have enough RAM for Qwen3-Coder-Next, the meilleurs LLM locaux offer lighter alternatives like Qwen 2.5 Coder 32B, which remains a solid FIM (Fill-in-the-Middle) option on 24 GB of VRAM according to InsiderLLM.
Limitations: what the model doesn't do (yet)
No multimodality
Qwen3-Coder-Next is a purely text-based model. It cannot read UI screenshots, diagrams, or Figma mockups. For these tasks, you need to turn to the multimodal models in the Qwen ecosystem (Qwen-VL) or to Claude Opus 4.7, which excels in visual understanding.
Limited chain-of-thought reasoning on architectural problems
The model shines at resolving localized bugs. On software architecture problems — "design a distributed messaging system with delivery guarantees" — proprietary models with explicit reasoning (Gemini 3 Pro Deep Think, o1-preview at 90.2) remain clearly superior. Qwen3-Coder-Next's RL optimizes for code that compiles and passes tests, not for design docs.
Integration ecosystem still young
The official CLI is functional but minimal. Compared to the ecosystem around Claude (Anthropic SDK, native integrations in Cursor, Windsurf, etc.) or GPT-5.5 (Copilot Enterprise, Actions), integrating Qwen3-Coder-Next into IDEs still requires some tinkering via Ollama and generic extensions.
❌ Common mistakes
Mistake 1: Comparing the total 80B with a 3B dense model
What's wrong: saying "Qwen3-Coder-Next is a 3B model" is misleading. The 3B are the active parameters, but the representational capacity depends on the total 80B. A dense Qwen3.6-27B (27B active out of 27B total) does not have the same depth of knowledge. The solution: always specify "80B MoE with 3B active per token".
Mistake 2: Quantizing too aggressively to run it on 16 GB
What's wrong: pushing the quantization to Q2_K to run the model on a 16 GB MacBook Pro destroys the MoE advantage. Poorly quantized experts lose their specialization, and the SWE-Bench score drops by 10+ points. The solution: accept that 32 GB is the realistic minimum, or use the Qwen3-Coder-30B-A3B instead.
Mistake 3: Using it as a simple autocomplete
What's wrong: plugging Qwen3-Coder-Next as a FIM model in VS Code for line-by-line autocomplete wastes its strength. The model is optimized for agentic reasoning over entire repositories, not for predicting the next 5 words. The solution: use it via the agent CLI, not as an autocomplete backend. For that, a lighter model is sufficient.
Mistake 4: Ignoring the agent scaffolding
What's wrong: evaluating Qwen3-Coder-Next in single-shot (a single generation, no iteration) yields a score 10-15 points below its potential. The model is designed to operate within an agentic loop. The solution: always test it with the official CLI or an agent framework that allows for iteration and test execution.
❓ Frequently Asked Questions
Does Qwen3-Coder-Next replace Claude for coding?
No. Claude Sonnet 4.6 (81.4 agentic) remains superior in general reasoning and multimodality. But Qwen3-Coder-Next is the best choice locally and for free, which is a game-changer for teams sensitive to costs or privacy.
What is the difference between Qwen3-Coder-Next and Qwen 2.5 Coder 32B?
Qwen 2.5 Coder 32B is a classic dense model, good at FIM autocompletion on 24 GB VRAM. Qwen3-Coder-Next is an agentic MoE model optimized for iterative bug resolution. According to InsiderLLM, the 2.5 Coder remains relevant for FIM, but the Coder-Next is the ultimate coding choice on 64 GB+.
Does the model work in French?
The Qwen ecosystem supports French, but Qwen3-Coder-Next is optimized for code — a universal language. For explanations in French or writing, check out the best LLMs in French. For pure code, the interface language hardly matters.
Will Qwen 3.5 preview improve coding performance?
Qwen 3.5 is scheduled for June 2026 according to InsiderLLM. It could integrate Coder-Next's improvements into a generalist model. But for pure agentic coding, Coder-Next's specialization will likely remain superior.
Can it be hosted on a VPS?
Yes. A VPS with 2x A100 40GB or 1x A100 80GB is enough for the non-quantized version. If you are looking for a host, Hostinger offers GPU VPSs suited for this use case. The model under the Apache 2.0 license allows for any commercial use.
✅ Conclusion
Qwen3-Coder-Next proves that open-source agentic coding has caught up with the level of mid-range proprietary models — and does so running on consumer hardware. At 70.6% on SWE-Bench Verified with only 3B active parameters, it's the model every developer should test locally before renewing a cloud subscription.