📑 Table of contents

Qwen3.7-Max: 35 hours of battery life and 1,158 tool calls — the AI agent that pushes the boundaries of long-running execution

Agents IA 🟢 Beginner ⏱️ 13 min read 📅 2026-05-23

Qwen3.7-Max : 35 hours of battery life and 1,158 tool calls — the AI agent that pushes the limits of long execution

🔎 35 hours, 1,158 tool calls, zero crashes: is it real?

Alibaba just unveiled Qwen3.7-Max, and the announced figures are dizzying. A model capable of maintaining an agentic session for 35 consecutive hours, chaining 1,158 tool calls without losing the thread. This is the first time an open-source player has crossed this execution longevity threshold.

The question is no longer whether AI agents can execute a complex task. It's whether they can do so over a human work duration, without contextual drift or cumulative hallucination. Qwen3.7-Max claims to do it. It remains to be seen whether the demonstration holds up or if it's an orchestrated benchmark for buzz.


The essentials

  • Qwen3.7-Max is Alibaba Qwen's flagship model, oriented exclusively towards long-duration agentic execution with a 1 million token context window.
  • The reference demonstration: 35 hours of continuous session, 1,158 tool calls, on a multi-step software research and development scenario.
  • It positions itself directly against GPT-5.5 (agentic score 98.2), Claude Opus 4.7 Adaptive (94.3) and Gemini 3.1 Pro (87.3), but with a differentiating argument: temporal persistence.
  • The model is not in the classic open-source leaderboard (dominated by DeepSeek V4 Pro at 88 points), because it belongs to a distinct category: long-run agentic models.

Tool Main usage Price (June 2025, check on site) Ideal for
Hostinger Hosting for deploying agents Starting from 2.99 €/month Deploying Qwen agents in production
Qwen3.7-Max (Alibaba Cloud API) Long-running agent Token-based pricing Multi-hour agentic sessions
GPT-5.5 (OpenAI) High-performance agent ~60 $/month (ChatGPT Pro) Short-term intensive agentic tasks

What Qwen3.7-Max really brings

An agentic model that lasts 35 hours is a break from the current paradigm. Most AI agents collapse after 20 to 30 minutes of continuous execution. The context becomes saturated, the initial instructions become diluted, and the agent enters repetitive loops.

Qwen3.7-Max solves this problem thanks to an architecture designed for persistence. The 1M token window is not just a marketing number: it is coupled with a progressive contextual compression mechanism that allows the model to maintain coherence over very long sessions.

According to the Qwen3-Omni technical report, the Qwen3 family introduced advanced multimodal management mechanisms. Qwen3.7-Max inherits this foundation but specializes it for the sequential execution of tools, whereas previous models were optimized for dialogue or single-generation tasks.

The fundamental difference: this is not an LLM that is transformed into an agent via an external framework. It is an LLM designed from the architecture to be an agent.


Agent-stack architecture: what changes under the hood

Qwen3.7-Max's agentic stack doesn't look like what you find at OpenAI or Anthropic. Alibaba opted for a native integration of tool-calling into the reasoning process, not as an overlay.

The model uses a variant of the mechanism described in the literature on the Gumbel-max Trick for tool selection in stochastic contexts. Specifically, instead of choosing the most probable tool deterministically, the model evaluates the probability distribution over the available tools and samples in a way that avoids loops. This explains the low repetition rate in the 1,158 calls.

The architecture relies on three pillars:

Hierarchical memory: a two-level system where critical instructions remain in an uncompressed priority cache, while intermediate results are progressively summarized. This avoids the "lost in the middle" phenomenon that plagues long contexts.

Decomposed planning: Qwen3.7-Max doesn't plan 35 hours in advance. It breaks things down into 15- to 30-minute micro-plans, with checkpoint points where it reassesses progress relative to the initial objective.

Drift detection: an internal mechanism (not an external wrapper) that measures the gap between the current action and the initial plan. If the gap exceeds a threshold, the model triggers self-correction without human intervention.

To understand how these principles translate in practice, créer son premier agent IA autonome offers a good conceptual starting point, even if Qwen3.7-Max pushes the logic much further than standard architectures.


35 hours and 1 158 tool calls: breaking down the demo

The mind-blowing demonstration was detailed by TechNode and analyzed in depth by AIMLAPI. The scenario: an agent must navigate a 200,000-line codebase, identify vulnerabilities, propose patches, test them, and write a comprehensive report.

The raw numbers:

Metric Value
Total duration 35h02
Total tool calls 1 158
Average calls/hour 33.1
Tool repetition rate 2.3 %
Tokens consumed ~820 000
Auto-triggered corrections 47
Human interventions 0

What's impressive is the repetition rate. In a classic 30-minute agentic session, we often observe 8 to 12% redundant calls. At 2.3% over 35 hours, the anti-loop mechanism is clearly working.

However, Decrypt raises a valid point: the test scenario is linear. It's a sequential pipeline with clear steps. The real difficulty of long-duration agents isn't the linear aspect, it's the unexpected — backtracking, ambiguities in specifications, contradictions between sources.

35 hours in a linear fashion is a technical feat. 35 hours with constant plot twists, we haven't seen the proof yet.


Comparison with agentic competitors

Placing Qwen3.7-Max in the current landscape requires separating two dimensions: raw performance (benchmark scores) and persistence capability (reliable execution duration).

Model Agentic score Context window Reported persistence Indicative price
GPT-5.5 98.2 256K tokens ~2-4h reliable ~$60/month
Claude Opus 4.7 Adaptive 94.3 200K tokens ~3-5h reliable ~$100/month
Gemini 3.1 Pro 87.3 1M tokens ~1-2h reliable ~$20/month
Kimi K2.6 (Self-host) 88.1 128K tokens ~2-3h reliable Open source
Qwen3.7-Max Unranked (new) 1M tokens 35h (demo) TBD

GPT-5.5 remains the king of raw performance on standardized agentic benchmarks (SWE-bench, AgentBench, TAU-bench). But these benchmarks measure sessions lasting from a few minutes to a few hours. None test persistence over 35 hours.

This is precisely Alibaba's bet: creating a new evaluation category. The Marktechpost report indicates that Qwen has released a new internal benchmark called "LongRun" to specifically measure this dimension.

For agents that need to automate a complete pipeline with an agent, this difference in persistence changes the game. An ETL + analysis + reporting pipeline that would take 4 hours with GPT-5.5 (by restarting it 3 times) could theoretically run in a single session with Qwen3.7-Max.


Qwen3.7-Max vs the Qwen3.6 family: the evolution

Qwen3.7-Max does not replace Qwen3.6. It is added as a specialized model at the top of the range. The Qwen3.6 family remains the relevant choice for dialogue, classic RAG, and generation tasks.

Model Specialization Open-source score Context
Qwen3.6-27B Lightweight generalist 74 128K
Qwen3.6-35B-A3B MoE generalist 67 128K
Qwen3.5-122B-A10B Heavyweight generalist 65 128K
Qwen3.5 397B Flagship generalist 64 128K
Qwen3.7-Max Long-running agent N/A 1M

The jump from 128K to 1M tokens is not trivial. It multiplies memory capacity by 8. But the real work is not in the context size — it's in managing that memory over time. A 1M token context filled in 2 hours is useless. A 1M token context that remains coherent over 35 hours is the claimed innovation.

The Qwen3-ASR and Qwen3-TTS reports show that the Qwen3 family is investing heavily in multimodal capabilities. Qwen3.7-Max inherits this foundation, which means it can theoretically process audio and voice inputs within its long agentic sessions — an asset for monitoring scenarios or prolonged meeting analysis.


1M token context: why size isn't enough

Everyone is talking about the 1M token window. But in reality, Gemini 3.1 Pro also offers 1M tokens, and it doesn't last 35 hours. Context size is a necessary condition, not a sufficient one.

The fundamental problem with long contexts in agentic execution is diluted attention. As the context fills up with tool results, logs, and computational intermediaries, the attention mechanism distributes its weights over more elements. The initial instructions proportionally receive less attention. This is mathematically inevitable with standard attention.

Qwen3.7-Max partially circumvents this with zone-based structured attention. Certain positions in the context are "protected" — they receive a baseline attention weight regardless of the context length. This is a similar approach in spirit to what is found in the literature on the MAX and MAXIMA experiments, where structural constraints preserve key signals despite ambient noise.

The lesson: don't confuse the context window with useful memory. 1M tokens of context does not mean 1M tokens of functional memory. Qwen3.7-Max claims about 60 to 70% useful memory at 35 hours, compared to 20 to 30% for unoptimized models.


Concrete scenarios where 35 hours change the game

The 35-hour session is not a gimmick if you map it to the right use cases. Here is where persistence becomes a real competitive advantage, not a benchmark figure.

Enterprise codebase security audit: an agent that scans, analyzes, and correlates vulnerabilities across a 500K-line repository, with back-and-forths between static analysis, dynamic testing, and report writing. A human would take a week. A classic agent would crash after 2 hours.

Assisted scientific research: going through 200 papers, extracting data, normalizing it, identifying contradictions, synthesizing. Each step requires tool calls (search, parsing, calculation). Consistency over 35 hours ensures that the final synthesis remains faithful to the findings from hour 1.

Legacy system migration: analyzing an existing system, mapping dependencies, generating the target code, testing compatibility, iterating. This type of project spans days for a human. An agent that can maintain context over 35 hours drastically reduces context loss between iterations.

For teams that want to dive deeper into the architecture of these systems, qu'est-ce qu'OpenClaw ? sheds light on an adjacent ecosystem of autonomous agents that share this long-running philosophy.


Limitations and gray areas

Despite the impressive numbers, several points remain unclear and deserve to be treated with skepticism.

The actual cost: 820,000 tokens over 35 hours is reasonable in terms of density. But at what price per token? Alibaba has not released the pricing schedule for Qwen3.7-Max. If the pricing is aligned with GPT-5.5, a 35-hour session could cost several hundred dollars.

Reproducibility: the 35-hour demo was published by the Qwen team. No independent reproduction has been published at the time of writing. Until external teams validate these numbers, caution is advised.

The test scenario: as highlighted above, it is a linear pipeline. Existing agentic benchmarks (TAU-bench, SWE-bench) pit agents against non-deterministic environments. We are waiting for Qwen3.7-Max on these benchmarks.

Absence from leaderboards: Qwen3.7-Max does not appear in any open-source leaderboards (dominated by DeepSeek V4 Pro at 88 points) or agentic ones (GPT-5.5 at 98.2). This could mean two things: either the model is too new, or its scores on standardized benchmarks are not competitive and Alibaba prefers to communicate on the "duration" metric.


❌ Common mistakes

Mistake 1: Confusing 1M context and 1M execution

Just because a model accepts 1M tokens as input doesn't mean it can consistently execute tasks over 1M tokens. Gemini 3.1 Pro has a 1M context but collapses during long agentic execution. Check persistence metrics, not window size.

Mistake 2: Comparing Qwen3.7-Max to generalist models

Qwen3.7-Max is a specialized model. Comparing it to Qwen3.5 397B (score 64) or DeepSeek V4 Pro (88) on generation tasks makes no sense. It's like comparing a 40-ton truck to a Ferrari: they don't drive on the same roads.

Mistake 3: Deploying Qwen3.7-Max for short tasks

If your agentic task lasts 5 minutes, Qwen3.7-Max is likely overkill and more expensive than necessary. The best LLMs for AI agents for short sessions remain GPT-5.5 or Claude Opus 4.7. Reserve Qwen3.7-Max for pipelines that exceed one hour.

Mistake 4: Ignoring the underlying infrastructure

35 hours of continuous execution also means 35 hours of stable connection, network error management, and server-side state persistence. The model doesn't do everything. If your infrastructure can't hold up for 35 hours, the best model in the world will be useless. Reliable hosting like Hostinger is a prerequisite, not a detail.


❓ Frequently Asked Questions

Is Qwen3.7-Max open source?

No. Unlike the Qwen3.6 family, which is available as open source (and listed in the best Ollama models), Qwen3.7-Max is only accessible via the Alibaba Cloud API. No open source release announcement has been made.

Can Qwen3.7-Max be run locally?

Not currently. With a 1M token context and an architecture optimized for agentic workflows, the VRAM requirements far exceed what a consumer setup can offer. For local use, open source AI agents with Ollama remain the realistic path, with models like Qwen3.6-35B-A3B or DeepSeek V4 Flash.

Are the 35 hours a hard limit or an average?

It is the result of a specific demonstration, not a theoretical limit. The model could likely go further on a scenario with fewer tool calls. However, no one has yet tested the actual upper bound of degradation.

Does Qwen3.7-Max replace agent frameworks like OpenClaw?

No. Qwen3.7-Max is a model, not a framework. It can be used within agent frameworks. To understand the difference between the model and orchestration, see how to create an AI agent and the best autonomous AI agents.

What is the relationship between Qwen3.7-Max and the ASR/TTS capabilities of Qwen3?

Qwen3.7-Max inherits the multimodal foundation of the Qwen3 family, including advances in speech recognition (ASR) and synthesis (TTS) documented in technical reports. This means it can theoretically integrate audio inputs into its agentic sessions, but this capability was not demonstrated in the 35-hour demo.


✅ Conclusion

Qwen3.7-Max won't be the model you use to generate an email or summarize an article. It's a niche model — the niche of ultra-long agentic execution — but it's a niche that will become central as companies move from "agentic proof of concept" to "continuous agentic production". The 35 hours remain to be independently confirmed, but the hierarchical memory architecture and native tool-calling integration set a new standard that GPT-5.5 and Claude Opus 4.7 will have to address. If reproducibility is there, this isn't just another benchmark — it's a paradigm shift for the best autonomous AI agents.