Gemini 3.5 Flash : the fast model that beats Opus 4.7 and GPT-5.5 on agent benchmarks — 289 tokens/second

LLM & Modèles 🟢 Beginner ⏱️ 11 min read 📅 2026-05-20

Gemini 3.5 Flash: the fast model that beats Opus 4.7 and GPT-5.5 on agent benchmarks — 289 tokens/second

🔎 Google just made the distinction between "fast" and "smart" models obsolete

On May 19, 2026, at Google I/O, Google announced the general availability of Gemini 3.5 Flash. A model classified in the "Flash" family, therefore supposed to be lightweight and fast. Except it beats Claude Opus 4.7 and GPT-5.5 on agent benchmarks.

This is a strong signal. For two years, the market accepted a trade-off: fast but average, or slow but excellent. Gemini 3.5 Flash breaks this logic by offering frontier scores at a fraction of the price and at 289 tokens/second.

The impact is immediate for agent developers. A model that thinks by default, that handles 1 million tokens of context, and that costs $1.50 per million input tokens. The economic argument is no longer a secondary one. It becomes the main reason to migrate.

The essentials

Gemini 3.5 Flash reaches 289 tokens/second, or about 4x the speed of comparable frontier models (GPT-5.5, Claude Opus 4.7).
It beats these same models on MCP Atlas (83.6%) and Terminal-Bench 2.1 (76.2%), two reference benchmarks for autonomous agents.
Thinking-on-by-default is activated natively, without additional configuration.
The context goes up to 1 million tokens, consistent with Google's strategy since Gemini 1.5 Pro.
Price: $1.50 / $9 per million tokens (input/output), a massive reduction compared to flagships.
Google's Antigravity CLI tool runs natively on it, cementing the model's agent-first positioning.

Recommended tools

The numbers that matter: benchmarks and speed

Gemini 3.5 Flash isn't just fast. It dominates on agent metrics where flagships were supposed to be untouchable.

On MCP Atlas, the model reaches 83.6%, ahead of GPT-5.5 (98.2 on the overall agentic score but lower on this specific tool-calling benchmark) and Claude Opus 4.7 (94.3 in overall agentic). On Terminal-Bench 2.1, it reaches 76.2%. Both of these benchmarks measure a model's ability to plan, call tools, and execute tasks in a real-world environment. Not multiple-choice questions.

The speed of 289 tokens/second confirms this positioning. According to data compiled by WaveSpeed, this represents about 4x the throughput of GPT-5.5 and Claude Opus 4.7 on comparable tasks. For an agent that needs to iterate quickly (call an API, analyze the response, adjust), latency is often the bottleneck, not reasoning.

The 1-million-token context completes the picture. An agent can ingest an entire code repository, complete documentation, or a long conversation history without intermediate summarization. This is a structural advantage that Google has maintained since the 1.5 generation.

To put these performances into the broader landscape, you can check out our comparison Claude, GPT, Gemini, Llama: which model to choose in 2026? which details the aggregated scores of each family.

Thinking-on-by-default : why it's a paradigm shift

Until now, "thinking" (extended reasoning like chain-of-thought) was an option that was manually enabled. Extra cost, extra latency, specific configuration. Gemini 3.5 Flash reverses this logic.

The model thinks by default. No need to pass a parameter or switch between a "normal" mode and a "reasoning" mode. The model decides for itself whether it should extend its thinking based on the complexity of the task. It's subtle but crucial for agents: a simple task shouldn't trigger 10 seconds of internal thinking, while a complex task shouldn't be processed in fast mode.

This approach echoes what Anthropic tried with Claude Opus 4.7's "Adaptive" mode, but Google positions it as the default behavior of a Flash model. The speed/intelligence tradeoff is managed internally, not by the developer.

According to Ars Technica, Google explicitly describes Gemini 3.5 Flash as "agent-optimized". It's not a generalist model that was optimized after the fact for agents. It's a model designed from the architecture up for the agent use case.

The economic argument: $1.50 per million tokens

This is perhaps the most unsettling figure for the competition. Gemini 3.5 Flash costs $1.50 for input and $9 for output per million tokens.

To put this into perspective, frontier models like GPT-5.5 and Claude Opus 4.7 generally range between $10 and $30 for input for the same volume. We are talking about a 10x to 20x differential on input cost, for equivalent or superior performance on agent benchmarks.

This changes the math of agent deployment. An agent running in a loop, calling tools, generating internal logs, consumes a massive amount of input tokens (the context fills up with each iteration). On a classic frontier model, the bill explodes. On Gemini 3.5 Flash, it remains contained.

For a startup deploying 10,000 agents in parallel, the difference amounts to tens of thousands of dollars per month. This is not a minor detail. It is a survival factor.

If you are evaluating costs more broadly, our article on Claude 4 vs GPT-5 vs Gemini 3: the honest comparison nobody makes details the pricing tiers of each provider.

Antigravity CLI : the native tool for Gemini agents

Google didn't just release a model. It delivered Antigravity CLI, a command-line tool designed to run natively on Gemini 3.5 Flash. The idea is simple: an equivalent to Claude Code or Codex CLI, but optimized for the Google stack.

The tool integrates directly into development workflows. It can read a repository, understand the structure, execute commands, and iterate. The fact that it runs on Flash rather than a Pro or Ultra model is a deliberate choice: Google is betting on iteration speed rather than step-by-step reasoning depth.

This is an interesting signal. At the same time, the open-source community is exploring similar approaches. The claude-code-forge project, for example, allows you to run Claude Code with any LLM. The ecosystem is moving toward a decoupling of the agent interface and the underlying model.

For developers looking for the best model for this type of tool, our guide to the meilleurs LLM pour coder compares performance on concrete development tasks.

Why a "Flash" model rivals flagships

The central question is this: how can a model from the Flash family, historically positioned as "mid-tier", beat frontier models on agent benchmarks?

Two factors explain this result. The first is specialization. Google didn't try to make a model that excels at everything. It optimized the architecture specifically for agent usage patterns: structured tool calls, multi-step planning, long context management. The overall agentic score (SWE-bench, pure code generation) may be lower than GPT-5.5. But on real agent tasks, Flash wins.

The second factor is inference efficiency. Advances in distillation, quantization, and attention architecture now make it possible to achieve frontier performance with a significantly smaller model. The Pareto curve between model size and performance has flattened since 2024.

According to Apidog's analysis, Gemini 3.5 Flash uses a hybrid attention architecture that reduces computational complexity on long sequences while maintaining quality on critical tokens. It's a smart technical compromise: not processing every token with the same depth, but concentrating the computational budget where it matters.

The trade-offs: what Flash doesn't do (yet)

Despite the impressive numbers, we need to be precise. Gemini 3.5 Flash beats the flagships on specific agent benchmarks. It does not beat them on all metrics.

On the general agentic scores compiled in June 2025, GPT-5.5 remains in the lead with 98.2, followed by Gemini 3 Pro Deep Think at 95.4 and Claude Opus 4.7 at 94.3. These scores measure a broader spectrum of capabilities. Flash excels at tool and terminal tasks, which is exactly what an agent needs in production. But on pure reasoning, complex synthesis, or creativity, frontier models retain an advantage.

The nuance is important. An agent in production spends 90% of its time calling tools, parsing responses, and adjusting its plan. On those 90%, Flash is better or equal. On the remaining 10% (deep reasoning, edge cases), a frontier model can make a difference. The question is whether this differential justifies the extra cost.

Another limitation: the ecosystem. OpenAI and Anthropic have a head start in terms of third-party integrations, SDKs, and communities. Gemini 3.5 Flash must catch up on the developer side, even if the Google API is mature.

Impact on the LLM market: the price war is on

Gemini 3.5 Flash is not just a product. It's a declaration of price war. By offering frontier performance at a mid-tier price, Google is forcing the hand of OpenAI and Anthropic.

The dynamic is clear: if a developer can get 90% of Opus 4.7's performance for 10% of the price, they migrate. Not because Flash is better everywhere, but because the value for money is unfavorable to flagships for agent use cases.

We can expect OpenAI to react by lowering the price of GPT-5.4 (87.6 in agentic) or by accelerating the release of more efficient models. Anthropic might play the specialization card with Claude Sonnet 4.6 (81.4), already positioned as the mid-range of the Claude lineup, but it doesn't have the same score on agent benchmarks.

The real risk for Google is retention. A cheap model attracts, but retention is built on the ecosystem, reliability, and support. The Gemini API availability incident in March 2026 remains fresh in the minds of DevOps teams.

To follow the monthly evolution of these dynamics, our comparison of the best LLMs May 2026 is updated every month with fresh scores and pricing.

What it means for agent developers

If you're building agents today, Gemini 3.5 Flash changes the math. Here are the concrete implications.

First, the cost per agent task drops drastically. An agent that cost $0.50 per task on a frontier model can drop to $0.05 on Flash. If your margin depends on volume, it's transformative.

Second, latency enables new patterns. At 289 tokens/second, an agent can iterate 4x faster on a problem. Less waiting for the user, more correction cycles within the same time budget.

Third, thinking-on-by-default simplifies the architecture. No more need to route between a fast model and a slow model based on complexity. Flash handles that internally.

The tradeoff: you lock yourself into the Google ecosystem. The Gemini API, the response format, the native tools. Portability to another provider isn't impossible (APIs are standardized), but Flash-specific optimizations don't transfer.

❌ Common mistakes

Mistake 1: Confusing agent score with general intelligence

Using the MCP Atlas score to claim that Flash is "smarter" than GPT-5.5 in absolute terms. This is wrong. Flash is better at the specific agent tasks measured by this benchmark. On pure reasoning or creative generation, frontier models remain superior. The benchmark is domain-specific, not a general IQ.

Mistake 2: Ignoring output cost

Focusing on the $1.50 input and forgetting the $9 output. Agents generate a lot of output tokens (reasoning logs, detailed plans, intermediate summaries). Output cost can account for 70 to 80% of the total bill. Do the full math before migrating.

Mistake 3: Deploying to production without regression testing

Benchmarks are benchmarks. Your actual use case may reveal weaknesses that MCP Atlas does not measure. Test Flash on your specific tasks before switching an agent to production. A blind test like the one conducted by guilamu on 14 LLMs for a WordPress plugin shows that surprises are frequent.

❓ Frequently Asked Questions

Does Gemini 3.5 Flash replace GPT-5.5 for all use cases?

No. Flash excels at agentic tasks (tool calls, rapid iterations, long context). For pure reasoning, complex synthesis, or creative tasks, GPT-5.5 (98.2 in general agentic) remains relevant. The choice depends on the use case, not the raw ranking.

Does thinking-on-by-default consume more tokens?

Yes, but the overconsumption is managed internally. The model dynamically decides whether it needs to extend its thinking. On a simple task, the extra cost is negligible. On a complex task, it is comparable to a classic call to a reasoning model.

Can Gemini 3.5 Flash be used locally?

Not at the moment. The model is offered via the Google API. For local use, we will have to wait for a potential open-source release or a distilled equivalent. In the meantime, local options remain Llama and derivatives.

How does Antigravity CLI compare to Claude Code?

Antigravity CLI is optimized for iteration speed thanks to Flash. Claude Code relies on Claude models (Opus or Sonnet) and excels at in-depth reasoning. For complex refactoring, Claude Code maintains the advantage. For rapid iterative development, Antigravity is competitive.

✅ Conclusion

Gemini 3.5 Flash is the model the agentic market was waiting for: frontier performance, 4x faster speed, and a price that makes large-scale deployment economically viable. The fast/intelligent distinction is officially dead. It now remains for Google to prove its reliability in production.

#ia-generative #gpt-5-5 #claude-opus-4-7 #gemini-3-5-flash #benchmarks-agents #google-io

📚 Related articles

LLM & Modèles 🟢 Débutant 4 min

ICML 2026 Seoul: 6,500+ papers accepted, ML enters the agentic era — key takeaways

Explore AI trends at ICML 2026 Seoul: over 6,500 accepted papers and the agentic era in machine learning.

2026-07-04 16:00

LLM & Modèles 🟢 Débutant 12 min

Claude Sonnet 5: Anthropic's most agentic model, Opus performance at Sonnet price

2026-07-01 15:02

LLM & Modèles 🟢 Débutant 12 min

OpenAI GPT-5.6: Sol, Terra et Luna — the model family that changes everything

Discover OpenAI GPT-5.6: Sol, Terra and Luna, the revolutionary model family under direct government control from June 26, 2026.

2026-06-29 15:03

📑 Table of contents