Qwen3 Coder Next : the open-source model that runs on a 64 GB Mac and beats DeepSeek in coding

LLM & Modèles 🟢 Beginner ⏱️ 13 min read 📅 2026-06-12

Qwen3 Coder Next : the open-source model that runs on a 64 GB Mac and beats DeepSeek in coding

🔎 Why an 80B model just made local coding serious

On June 12, 2026, Alibaba quietly released Qwen3-Coder-Next. No keynote, no media launch. Just a technical report on arXiv and a GitHub repository. Yet, this model just moved a boundary line we thought was set in stone: the threshold below which an open-source model can replace Claude Sonnet for real development work, locally, on hardware a freelancer can afford.

The striking number: 74.2% on SWE-Bench Verified with only 3 billion active parameters per token. DeepSeek V3.2 caps at 40.9% on SWE-Bench Pro under the same conditions. Even accounting for benchmark version differences, the gap is enough to demand a serious look.

The real novelty isn't the raw score. It's the performance/compute cost ratio. An 80B model that only activates 3.7% at each forward pass is an efficiency that the dense architecture of DeepSeek V4 Pro cannot replicate. And on a MacBook Pro M4 Max with 64 GB of unified RAM, it runs at about 12 tokens per second in 4-bit quantization. Enough for an interactive code agent.

The Essentials

80B/3B Active MoE Architecture: 80 billion total parameters, but only 3 billion are computed per token — the rest remain inactive, drastically reducing inference cost.
74.2% on SWE-Bench Verified: comparable to what expensive proprietary models offer, according to the official technical report on arXiv.
Runs on a 64 GB Mac at ~12 tok/s: in 4-bit quantization via Ollama or llama.cpp, confirmed by the local deployment guide.
Apache 2.0 License: free commercial use, no restrictions unlike DeepSeek V3.1 which opted for the MIT license — both approaches are equally valid here, but Apache 2.0 is more protective regarding patents.
Available everywhere: HuggingFace, OpenRouter, Ollama, LM Studio from day one.

Recommended Tools

Tool	Main Usage	Price (June 2026, check website)	Ideal for
Ollama	Local CLI execution	Free	Mac/Linux developers, terminal workflow
LM Studio	Local GUI	Free	Beginners, visual exploration
OpenRouter	Unified cloud API	Pay-as-you-go (~$0.15/1M tok input)	Quick testing without local hardware
HuggingFace	Weights download	Free	Advanced self-hosting, fine-tuning

Architecture: the hybrid MoE that changes the game

Qwen3-Coder-Next does not resemble previous generation code models. The architecture combines two rarely associated mechanisms: sparse Mixture of Experts (MoE) and hybrid linear attention.

80B/3B MoE: the sweet spot finally found

The model totals 80 billion parameters distributed across expert layers. At each token, a routing mechanism activates only 3 billion of them. In practice, this means the computational cost per token is that of a 3B model, but the representational capacity is that of an 80B.

The arXiv technical report details this architecture: experts are specialized during training, with some focusing on code comprehension, others on logical reasoning, and still others on patch generation. The router learns to select the right combination based on the context.

This is fundamentally different from the dense approach of DeepSeek V4 Pro (which activates all its parameters at each token) and more granular than the MoE of Qwen3.5-122B-A10B which activates 10B per token. Here, 3B active parameters are sufficient thanks to the quality of the routing and the hybrid attention.

Hybrid linear attention: why it's crucial for local

Standard attention has an O(n²) complexity relative to context length. Linear attention reduces this to O(n). Qwen3-Coder-Next alternates between the two depending on the layer: the lower layers use standard attention to capture fine local dependencies (syntax, variable names), while the upper layers switch to linear attention to process global context (project architecture, cross-file dependencies).

This hybridation makes it possible to handle long contexts — typically an entire code repository — without memory exploding. This is what makes execution possible on 64 GB of unified RAM with usable performance.

Performances: the benchmarks that matter in coding

Raw scores mean nothing without context. Here is where Qwen3-Coder-Next stands against the competition on real-world code benchmarks.

SWE-Bench: the real test

SWE-Bench measures a model's ability to solve real GitHub tickets. It is the most relevant benchmark for evaluating a code agent.

Model	SWE-Bench Verified	SWE-Bench Pro	Active parameters	Access
Qwen3-Coder-Next	74.2%	44.3%	3B	Open-weight (Apache 2.0)
GPT-5.5 (OpenAI)	~82% (estimated)	~52% (estimated)	N/A	Proprietary
Claude Sonnet 4.6	~72% (estimated)	~41% (estimated)	N/A	Proprietary
DeepSeek V3.2	N/A	40.9%	~37B (dense)	Open-weight (MIT)
GLM-4.7	N/A	40.6%	N/A	Proprietary
DeepSeek V4 Pro	N/A	N/A	Dense	Open-weight (MIT)

Sources: ChatForest review, Agent Market Cap analysis, arXiv report.

Two things stand out. First: Qwen3-Coder-Next beats DeepSeek V3.2 on SWE-Bench Pro with 12 times fewer active parameters. Second: it directly rivals Claude Sonnet 4.6 (agentic score of 81.4 in the overall ranking) on real ticket resolution, all while running locally.

Terminal-Bench and multilingualism

The model reaches 63.7% on SWE-Bench Multilingual, a subset focused on non-English repositories. This is a strong signal for French-speaking developers working on codebases with comments and documentation in French — a topic we already covered in our comparison of the best LLMs in French.

On Terminal-Bench (the ability to execute correct shell commands in an agentic environment), Qwen3-Coder-Next shows solid results in the technical report, confirming its vocation as a complete code agent, not just a snippet generator.

Local deployment: the practical guide

This is where the model becomes interesting for the independent developer. A score of 74% on SWE-Bench is useless if you can't get it to run.

On a 64GB Mac: the viable configuration

The reference guide for Mac Silicon is categorical: with 64GB of unified RAM, Qwen3-Coder-Next is the recommended choice for serious coding. On 32GB, you have to fall back on Qwen3.5-35B-A3B.

In practice, with Ollama and Q4_K_M (4-bit) quantization, the model takes up about 42-45 GB of RAM. This leaves 19-22 GB for the context, the system, and other applications. The measured throughput is around 12 tokens per second — sufficient for a code agent that iterates on patches, but not for fluid conversational chat.

The complete installation guide on dev.to details the step-by-step procedure. The Codersera guide covers Ollama and llama.cpp specifics.

For those who want to go further with local setups, our local LLM installation guide covers the basics of Ollama and LM Studio. And to explore other local options, the comparison of the best LLMs to run locally remains the reference.

On NVIDIA GPUs: 1 H100 or 2x RTX 5090

The Local AI Master guide confirms that the model runs comfortably on a single H100 (80 GB VRAM) at full precision, or on two RTX 5090s in a split configuration. In the latter case, throughput rises to 25-30 tok/s — a clear comfort for agentic work.

Via OpenRouter: to test without investing

If you don't have the hardware, OpenRouter offers Qwen3-Coder-Next via API. The cost is estimated at around $0.15 per million input tokens (June 2026, check on openrouter.ai). It's cheap enough to integrate the model into a CI/CD pipeline or an automated code review tool.

Comparison with the competition: where Qwen3-Coder-Next stands

Against DeepSeek: sparse MoE wins

The most natural comparison is with DeepSeek V3.1 and its MIT license. DeepSeek popularized MoE in the open-source world, but its architecture remains denser than that of Qwen3-Coder-Next.

DeepSeek V4 Pro, the open-source ranking leader with a score of 88, is a massive model requiring serious infrastructure. Qwen3-Coder-Next makes a different philosophical choice: sacrificing 15-20 points of raw performance to be executable on a laptop. It's a compromise that most independent developers will be happy to make.

On SWE-Bench Pro, Qwen3-Coder-Next's 44.3% compared to 40.9% for DeepSeek V3.2 (source: Agent Market Cap) shows that more aggressive expert routing (3B vs ~37B active) is not a handicap when well designed.

Against Claude and GPT: the tipping point

Claude Sonnet 4.6 (agentic score 81.4) and GPT-5.5 (agentic score 98.2) remain above in raw capability. But Qwen3-Coder-Next handles "75-80% of what Claude Sonnet 5 does" according to the evaluation by Local AI Master. For a developer solving 20 tickets per week, if 15 of them can be processed locally without sending their code to Anthropic, that's a massive gain in privacy and cost.

The parallel with Meta Muse Spark and its pivot to closed-source is enlightening. While Meta closes its flagship model, Alibaba opens its own. The open-source dynamic is shifting towards the Qwen ecosystem, and Qwen3-Coder-Next is its best demonstration.

In the Qwen landscape: where it stands

In the Qwen family, the code model clearly stands out from generalist models. Qwen3.6-27B (score 74 in the general ranking) and Qwen3.5-35B-A3B (score 67) are good compact models, but they are not optimized for the agentic code workflow. Qwen3-Coder-Next was specifically trained with a large-scale agentic curriculum — it doesn't just complete code, it plans, executes, iterates, and corrects.

The agentic workflow: how to use Qwen3-Coder-Next as a code agent

A code model is not just an autocomplete. Qwen3-Coder-Next was designed from the ground up as an agent, not as a completion engine.

Agentic training makes the difference

The technical report describes a multi-phase training process. The first phase is classic pre-training on code. But the subsequent phases inject agentic trajectories: the model learns to read a repository, identify the relevant file, generate a patch, test it, and iterate in case of failure. This is not prompt engineering applied after the fact — it's baked into the weights.

This is what makes it naturally compatible with agent frameworks like the ones described in our guide on open source AI agents with Ollama. The model natively understands tool formats (tool calls), feedback loops, and repair strategies.

Recommended configuration for a local agent

The dev.to guide recommends a specific configuration: temperature 0.1-0.2 for code patches (precision), 0.6-0.7 for planning and solution exploration. The maximum supported context allows loading an entire medium-sized repository in a single pass.

For developers who want to go further into the agentic approach, our article on the best LLMs for AI agents details compatible architectures. And to understand the broader context, the comparison of the best LLMs for coding situates Qwen3-Coder-Next within the ecosystem.

The MoE 80B/3B: the new sweet spot for local

The industry was looking for the perfect balance between capability and local inference cost. Qwen3-Coder-Next strongly suggests that this point is around 80B total / 3B active.

Why not smaller?

A pure dense 3B model (like Qwen3.5-35B-A3B in "all-active" mode) lacks the knowledge diversity of an 80B MoE. Specialized experts bring a depth that dense compression cannot reproduce. On coding tasks that require understanding obscure APIs, legacy frameworks, or rare architectural patterns, the MoE systematically outperforms the dense model at the same inference cost.

Why not larger?

A 400B MoE like Qwen3.5 397B (score of 64 in the general ranking) requires a minimum of 128 GB of VRAM in 4-bit. That's server territory, not laptop territory. The performance/hardware ratio of Qwen3-Coder-Next is optimal precisely because it was designed for the "64 GB of unified RAM" constraint.

This sweet spot is important for the best Ollama models because it defines a new category: "laptop-capable but agent-grade" models. Until now, you had to choose between "runs on my Mac but limited performance" and "high performance but server required". Qwen3-Coder-Next eliminates this compromise.

❌ Common mistakes

Mistake 1: Comparing total and active parameters without distinction

Confusing the "80B parameters" of Qwen3-Coder-Next with "80B dense" is a fundamental mistake. This model only uses 3B at each token. Comparing it directly to a dense 70B model on inference cost makes no sense. The 77B inactive parameters cost nothing in compute — they only cost in VRAM for storing the weights.

Mistake 2: Ignoring the necessary quantization on a 64GB Mac

Trying to run Qwen3-Coder-Next in full precision (FP16) on 64GB of unified RAM will crash or massively swap. 4-bit quantization (Q4_K_M via GGUF) is not optional — it is required. The model goes from ~160GB (FP16) to ~42GB (Q4_K_M). The quality loss is negligible for coding tasks, as confirmed by the benchmarks in the technical report.

Mistake 3: Using it as a simple autocomplete

Qwen3-Coder-Next is designed for agentic workflows (plan → code → test → iteration). Using it as a simple line completion in VS Code means underutilizing 90% of its value. Connect it to an agent framework, give it access to your terminal and your tests, and let it work on complete tickets.

Mistake 4: Neglecting the available context

With 64GB of RAM and a model taking up 42GB, you have ~20GB left for the context. This is enough for a medium-sized repository, but not for a 500,000-line monorepo. Pre-filter the relevant files before injecting them into the context, or use a retrieval system to provide only what is necessary.

❓ Frequently Asked Questions

Does Qwen3-Coder-Next really replace Claude Sonnet for coding?

No, not entirely. It handles about 75-80% of the tasks that Claude Sonnet 5 processes, according to Local AI Master. For complex tickets requiring advanced multi-step reasoning, Claude remains superior. But for the majority of daily tasks, Qwen3-Coder-Next is sufficient, locally and for free.

What is the difference between the Apache 2.0 license and DeepSeek's MIT license?

Both allow commercial use. Apache 2.0 includes an explicit patent grant clause (protection against patent lawsuits from the licensor) and requirements for noting modifications. MIT is more permissive but less protective. For enterprise use, Apache 2.0 is often preferred by legal teams.

Can you fine-tune Qwen3-Coder-Next on your own codebase?

Yes, the Apache 2.0 license allows it. In practice, full fine-tuning of an 80B MoE requires significant hardware (multi-GPU). Efficient fine-tuning (LoRA/QLoRA) on the attention layers is more realistic on an accessible setup. The official GitHub repository provides the necessary scripts.

Is 12 tok/s really usable for a code agent?

Yes, for the right workflow. A code agent spends 80% of its time reading and analyzing (low generation), and 20% generating patches (high generation). During high generation moments, 12 tok/s produce a 200-line patch in about 30 seconds. That is acceptable. For conversational chat, it is slow — but that is not the targeted use case.

Is Qwen3-Coder-Next better than other Qwen models for code?

Yes, significantly. Generalist Qwen models like Qwen3.6-27B or Qwen3.5-35B-A3B are competent in code but have not received the specific agentic training. Qwen3-Coder-Next is a specialized model, not a versatile generalist. For code, it clearly outperforms its siblings.

✅ Conclusion

Qwen3-Coder-Next is the first open-weight model that makes serious agentic coding possible on a developer's laptop. With 74.2% on SWE-Bench Verified and 3B active parameters, it proves that well-designed sparse MoE can bridge the gap between local and proprietary cloud. If you have a 64GB Mac or two RTX 5090, download it on HuggingFace, install Ollama, and test it on your real tickets — the numbers confirm what the experience will show you.

#intelligence-artificielle #deepseek #qwen3-coder-next #modele-open-source #coding-local #mac-64-go

📚 Related articles

LLM & Modèles 🟢 Débutant 12 min

July 17: Gemini 3.5 Pro and Shanghai's WAIC collide — the day AI officially goes bipolar

On July 17, 2026, the Gemini 3.5 Pro launch and Shanghai WAIC illustrate two opposing visions. Discover this key day for AI.

2026-07-14 17:03

LLM & Modèles 🟢 Débutant 14 min

GPT-Live : OpenAI launches full-duplex voice — AI agents can finally listen and speak at the same time

OpenAI launches GPT-Live with full-duplex voice. Discover how AI agents can finally listen and speak at the same time.

2026-07-13 15:04

LLM & Modèles 🟢 Débutant 11 min

Meta Muse Spark 1.1 : Meta launches its first paid model and enters the agentic coding battle

Discover Meta Muse Spark 1.1, Meta's first paid model. The giant enters the agentic coding battle and changes strategy.

2026-07-11 15:02

📑 Table of contents