📑 Table of contents

CacheRL: A Qwen3-4B model achieves 92% accuracy in tool-calling with 100 times less compute than GPT-5

LLM & Modèles 🟢 Beginner ⏱️ 13 min read 📅 2026-06-16

CacheRL : a Qwen3-4B model achieves 92% accuracy in tool-calling with 100 times less compute than GPT-5

🔎 Why a 4-billion parameter model just made GPT-5 obsolete for tool-calling

Tool-calling is the backbone of AI agents. Without the ability to reliably call functions over multiple conversation turns, an agent remains a simple chatbot. Until now, the implicit rule of the market was clear: for robust multi-turn tool-calling, you needed a massive model — GPT-5, Claude Opus 4, or at the very least a DeepSeek V4 Pro.

On June 12, 2026, a paper on arXiv (2606.14179v1) just shattered this assumption. CacheRL, a reinforcement learning method developed by Md Rizwan Islam and Aditya Thakur, enables Qwen3-4B — an open-source model with 4 billion parameters — to achieve 92% accuracy on multi-step tool-calling tasks. GPT-5 caps at 94% on the same benchmark.

The difference in compute cost? A factor of 100. This is a paradigm shift for anyone building AI agents in production.


The key takeaways

  • 92% process accuracy in multi-turn tool-calling with Qwen3-4B, compared to 94% for GPT-5, according to the CacheRL paper.
  • 100x less compute than GPT-5 for a nearly identical result, thanks to cached rollouts and a hybrid reward.
  • Knowledge transfer from large models: the small model learns to use tools without ever accessing them directly during RL training.
  • Direct impact on self-hosting agents: a 4B model runs on any laptop, making local agents reliable for the first time.

Tool Main Usage Price (June 2026, check website) Ideal for
Qwen3-4B Base model for CacheRL Free (Apache 2.0) Lightweight agents, self-hosting
Ollama Local LLM runtime Free Quick deployment of Qwen3-4B
LM Studio GUI for local LLMs Free (paid Pro version) Developers who want to test CacheRL
Hostinger VPS hosting for agents Starting at 4.99 €/month Deploying 24/7 agents in production

What CacheRL really is — and what it isn't

CacheRL is not a new model. It is a training method applied to an existing model (Qwen3-4B). This distinction is crucial for understanding the scope of the result.

The fundamental problem CacheRL solves: how to train a small model to master multi-turn tool-calling via reinforcement learning, without it being a logistical and financial nightmare. The classic RL approach for agents requires real tool calls at every training step. It is slow, expensive, and unstable.

CacheRL bypasses these three problems simultaneously. The model trains on cached rollouts — hence the name — with a hybrid reward system that combines syntactic verification and semantic verification. The result is learning that is 100x more compute-efficient, as detailed in the discussion on CatalyzeX.


The 3 technical innovations that make the difference

Cached rollouts: RL without live tool calls

In a classic RL scheme for tool-calling, at each generation step, the model produces a tool call, the environment executes it, and the result is injected into the context. This process is repeated over thousands of training episodes. It is extremely slow and costly.

CacheRL pre-generates and caches the results of tool calls. During RL training, the model "replays" these sequences without ever calling the tool live. The advantage is twofold: training speed explodes, and the reward signal becomes deterministic (no variability linked to unstable APIs or network delays).

Hybrid reward: punishing both form and content errors

A model can formulate a syntactically correct tool call (valid JSON, correct parameter names) but one that is semantically wrong (incorrect values, absurd logic). Most reward systems only check the form.

CacheRL combines two signals:
- Format reward: Is the JSON valid? Are the required parameters present?
- Execution reward: Would the result of the tool, if executed, lead to the correct final answer?

This double verification is what makes it possible to reach 92% accuracy rather than the usual 70-75% for small models in tool-calling.

Knowledge transfer from large models

This is perhaps the most elegant innovation. The base Qwen3-4B does not know how to do multi-turn tool-calling. But large models (GPT-5, Claude) do. CacheRL uses these large models to generate high-quality rollouts that will serve as training data for the small model.

Specifically, GPT-5 is asked to solve multi-tool tasks. Its trajectories are cached. Then, Qwen3-4B is trained to reproduce these behaviors via RL, with the hybrid reward correcting its drifts. This is behavioral distillation, not simple logit distillation. The Semantic Scholar page of the paper details this transfer mechanism.


Benchmarks: 92% vs 94%, but at what cost?

The key figure in the paper: 92% process accuracy on the multi-turn tool-calling benchmark. GPT-5 achieves 94% on the same benchmark. But a raw comparison of percentages masks the true revolution.

Model Process Accuracy Size Relative Compute Estimated Cost per Million Calls
GPT-5 94 % ~1.8T (estimé) 100x ~150 $ (juin 2026, vérifiez sur openai.com)
Qwen3-4B + CacheRL 92 % 4B 1x ~0,50 $ (auto-hébergé)
DeepSeek V4 Pro (Max) 88 % ~600B (estimé) ~40x ~30 $ (juin 2026, vérifiez sur deepseek.com)
Qwen3-4B (base, fine-tuné SFT) ~72 % 4B 1x ~0,50 $ (auto-hébergé)

A two percentage point difference, but a 1-to-300 cost ratio. For a CTO deploying agents at scale, this calculation leaves no room for debate. The 2% difference is easily offset by fallback or retry mechanisms.

The discussion on AlphaXiv also points out that in real-world scenarios with automatic retry, the effective accuracy of CacheRL surpasses that of GPT-5 in single-shot, since the cost of a retry on a 4B model is negligible.


Concrete impact on AI agent development

Local agents become viable in production

Until now, the best LLM for AI agents was systematically a hosted proprietary model. The reason: no local model was reliable enough in multi-turn tool-calling to be used in production without constant human supervision.

CacheRL changes the game. Qwen3-4B requires about 8 GB of VRAM in 4-bit quantization. This runs on a MacBook Pro M2, on a basic VPS at Hostinger, or on any consumer graphics card. You can check out our local LLM installation guide to set this up in 10 minutes.

A locally running agent means: zero network latency, zero API costs, zero data leakage to a third party. For companies dealing with sensitive data (healthcare, finance, legal), this is a massive selling point.

The MCP architecture becomes more relevant

The complete guide to MCP, Function Calling and Tool Use explains how the Model Context Protocol standardizes exchanges between models and tools. But MCP only makes sense if the model using it is reliable. A model that fails 30% of its tool calls makes MCP frustrating, not useful.

With CacheRL, a 4B model becomes reliable enough to take full advantage of MCP. This means the MCP tool ecosystem (databases, APIs, file systems) becomes accessible without a cloud budget. The best local LLMs have just gained a major skill.

The cost of agent pipelines collapses

Let's take a real-world case: a customer support agent that accesses a knowledge base, queries a CRM, and creates tickets. With GPT-5, each interaction costs about $0.15. With Qwen3-4B + CacheRL, the same call costs $0.0005. For 10,000 interactions per day, you go from $1,500/day to $5/day.

This is not a marginal optimization. It is an order-of-magnitude change that makes agent business models viable that were not previously.


Qwen3-4B: why this specific model?

The choice of Qwen3-4B as the base model is not random. Its Hugging Face page reveals features that make it ideal for CacheRL:

  • 32,768 native context tokens, extendable to 131,072 via YaRN. Enough for multi-turn conversations with large tool results.
  • Thinking-native architecture (Qwen3-4B-Thinking): the model can produce chains of thought before generating the tool call, which significantly improves reasoning accuracy.
  • Apache 2.0 license: unrestricted commercial use, unlike some models in the Llama family.

The base model alone achieves modest scores in tool-calling. This is precisely the point of the paper: the CacheRL method transforms a decent model into an exceptional one on this specific task. It is a perfect illustration of the postulate "the model matters less than the training method".

For those who want to compare, our Claude, GPT, Gemini, Llama comparison provides an overview of the models available in June 2026.


What CacheRL implies for the AI race

The end of "bigger is always better" in tool-calling

The industry took it for granted that tool-calling reliability was an emergent property of model size. More parameters = more reasoning = better tool calls. CacheRL proves this wrong, or at least inefficient.

Tool-calling is not a general reasoning problem. It's a structured pattern matching problem combined with localized reasoning. A small model, properly trained on the right patterns via RL, can excel on this specific dimension without having the general understanding of a GPT-5.

The discussion on PaperReading.club raises an interesting question: if CacheRL works for tool-calling, what other "specialized" skill of large models could be distilled in the same way?

Open-source models gain a competitive advantage

CacheRL is an open-source method applied to an open-source model. Anyone can replicate the training, adapt it to their specific tools, and deploy the result without paying a license. This is a massive structural advantage over proprietary models.

In the context of the monthly comparison of the best LLMs, this means that the "open-source" category will likely gain points quickly on the "agents and tool-calling" dimension, which is precisely the one that matters most in production.

Compute efficiency becomes the real metric

The CacheRL paper is part of a broader movement: shifting from "what is the best absolute score?" to "what is the best score per unit of compute?". This is the metric that matters for businesses. Gemini 3.5 Flash had already shown that a "fast" model could beat "premium" models on agent benchmarks while being 10x faster. CacheRL pushes the logic even further with a 100x factor.


Important limits and nuances

CacheRL is impressive, but it doesn't solve everything. A few essential nuances:

The 92% accuracy is measured on a specific benchmark. The paper uses a set of multi-tool tasks defined by the authors. In real-world scenarios with poorly documented tools, unstable APIs, or complex schemas, the figure will likely be lower.

Knowledge transfer depends on large models. To generate high-quality rollouts, you first need access to GPT-5 or equivalent. The cost of generating training data is not zero — it is simply paid once, then amortized over millions of inferences.

The model remains limited in general reasoning. Qwen3-4B + CacheRL excels at tool-calling, but this doesn't turn it into a generalist model. For tasks that require both tool-calling AND in-depth reasoning (complex legal analysis, advanced mathematical reasoning), a large model is likely still superior. The meilleurs LLM pour coder remain large models for a reason.

CacheRL training itself requires skills. This is not a model you download and use as is. It is a method that must be applied to your own tool context. This requires RL infrastructure and ML engineering expertise.


❌ Common mistakes

Mistake 1: Confusing CacheRL with a model

CacheRL is not a downloadable model. It is a training method. You cannot "install CacheRL" like you install Ollama. You must either reproduce the training or wait for a model pre-trained with this method to be released by the community.

Mistake 2: Believing that 92% means "ready for production without a safety net"

92% accuracy means 8 errors out of 100 tool calls. In a 5-step agent pipeline, this results in approximately a 34% chance that AT LEAST one step will fail. You must always implement retry mechanisms, fallbacks, and result validation.

Mistake 3: Ignoring the cost of generating rollouts

The "100x less compute" compares the inference of Qwen3-4B to that of GPT-5. But the initial generation of rollouts via GPT-5 has a cost. For a use case with 3 tools, this cost is negligible. For an ecosystem with 200 tools, it becomes significant.

Mistake 4: Using the base Qwen3-4B expecting the same results

The vanilla Qwen3-4B model (without CacheRL) achieves much lower scores in tool-calling. The magic is in the training method, not in the model architecture. Don't expect a miracle by downloading the raw model from Hugging Face.


❓ Frequently Asked Questions

Does CacheRL replace GPT-5 for all agent use cases?

No. CacheRL specifically optimizes multi-turn tool-calling. For agents that also require deep reasoning, complex planning, or creativity, GPT-5 remains superior. CacheRL is ideal for "executive" agents: reliably chaining tool calls.

Can CacheRL be applied to models other than Qwen3-4B?

Yes, in principle. The method is base-model agnostic. The authors chose Qwen3-4B to demonstrate the most impressive result (the smallest possible model), but the approach should work with other models of a similar size. For larger models like the meilleurs modèles Ollama, the relative gain would likely be smaller.

What hardware is required to deploy Qwen3-4B in production?

In 4-bit quantization (GGUF), Qwen3-4B requires about 3-4 GB of VRAM. A VPS with 8 GB of RAM is sufficient for moderate workloads. For high throughput, a GPU with 8+ GB of VRAM (RTX 3060 or equivalent) is recommended. Basic VPS hosting from a provider like Hostinger can be suitable for prototypes.

Are the results reproducible?

The paper provides the necessary methodological details. However, the quality of the generated rollouts (and therefore the final result) depends on the teacher model used and the quality of the tool definitions. The 92% is a benchmark result, not an absolute guarantee for any replication.

Does CacheRL work with tools in French?

Rollouts can be generated in any language, including French. However, the paper's benchmarks are in English. For LLM en français, it would be necessary to generate French-language rollouts, which adds a step but does not change the fundamental method.


✅ Conclusion

CacheRL demonstrates that a 4-billion parameter model can rival GPT-5 on multi-turn tool-calling, at a compute cost 100 times lower. The method combines cached rollouts, hybrid reward, and knowledge transfer to transform a correct model into a reliable agent. For developers and CTOs building AI agents, this means one thing: the barrier to entry for reliable self-hosted agents has just collapsed. If you want to explore models that could benefit from this approach, check out our comparatif des meilleurs LLM open-source and start experimenting.