📑 Table of contents

Best LLMs

LLM & Modèles 🟢 Beginner ⏱️ 15 min read 📅 2026-05-09

Best LLMs (May 2026) — Monthly Comparison

🔎 Why this comparison changes everything this month

May 2026 marks a turning point. Anthropic's Claude Mythos Preview simultaneously dominates the general and agentic rankings with unprecedented scores (99/100 and 100/100). OpenAI and Google respond with GPT-5.5 and Gemini 3.1 Pro, but the gap has widened.

The real novelty? "Adaptive" models like Claude Opus 4.7 (Adaptive) that adjust their computation in real-time based on task complexity. No more need to choose between fast and smart: the model decides for you.

Another strong signal: DeepSeek V4 Pro ranks 9th overall, confirming that Chinese models are no longer outsiders but serious competitors in terms of raw quality. The LMSYS Chatbot Arena remains the benchmark for validating these rankings through blind human voting.


The essentials

  • Claude Mythos Preview takes the absolute lead in the general (99) and agentic (100) rankings, an unprecedented double.
  • GPT-5.5 and Gemini 3.1 Pro share second place (91-92) but do not yet threaten Anthropic on complex tasks.
  • "Adaptive" models (Claude Opus 4.7 Adaptive) represent the new trend: the model calibrates its power on the fly.
  • DeepSeek V4 Pro (Max) enters the global top 10 (88), pushing European models out of the leading pack.
  • Usage cost remains the decisive criterion: without systematic benchmarking, you are likely paying 5 to 10x too much according to Karl Llorey.

Model Provider Overall score Agentic score Best for
Claude Mythos Preview Anthropic 99 100 Complex tasks, autonomous agents
Gemini 3.1 Pro Google 92 87.3 Multimodal, document analysis
GPT-5.5 OpenAI 91 98.2 AI agents, ecosystem integrations
GPT-5.4 Pro OpenAI 91 91.8 Advanced reasoning, code
Claude Opus 4.7 (Adaptive) Anthropic 90 94.3 Versatile use, cost-optimized
Gemini 3 Pro Deep Think Google 90 95.4 Long reasoning, mathematics
Grok 4.1 xAI 90 79 Real-time data, Twitter/X
GPT-5.4 OpenAI 89 87.6 OpenAI value for money
DeepSeek V4 Pro (Max) DeepSeek 88 84 Cost-effective premium alternative
Claude Opus 4.6 Anthropic 87 84.7 Reliability, long-form writing

To learn more, check out our monthly comparison of the best LLMs with pricing details and use cases.


Overall Ranking — The Top 15 Models

The hierarchy is clear: Anthropic holds the top two spots, OpenAI and Google are battling for the rest of the podium, and DeepSeek is the surprise.

The Top 5: Anthropic Domination

Claude Mythos Preview leaves no room for doubt. With a 99/100, it leads Gemini 3.1 Pro by 7 points — a considerable gap at this scale. GPT-5.5 and GPT-5.4 Pro follow at 91, then Claude Opus 4.7 (Adaptive) opens the top 5 at 90.

What sets Mythos apart is its ability to maintain coherent reasoning on prompts of over 10,000 tokens. Where other models get lost or contradict themselves, Mythos keeps track. The EQ-Bench Longform Creative Writing benchmark confirms this on long-form creative writing.

Places 6 to 10: The Chasing Pack

Gemini 3 Pro Deep Think (90) and Grok 4.1 (90) offer very different profiles. The Google model excels in extended reasoning, while Grok shines on real-time data via the X ecosystem.

GPT-5.4 (89) remains the pragmatic choice within the OpenAI ecosystem. DeepSeek V4 Pro Max (88) is the revelation: a Chinese model that directly rivals Western offerings. Claude Opus 4.6 (87) remains solid but is starting to show its age compared to its adaptive successor.

Places 11 to 15: Specialized Models

GPT-5.3 Codex (87) is an interesting case: ranked 11th overall but designed for code. For development, check out our guide to the best LLMs for coding. DeepSeek V4 Pro High (84), Kimi K2.6 (84), Claude Sonnet 4.6 (83), and GLM-5.1 (83) complete this lineup.

Moonshot AI's Kimi K2.6 deserves attention as a self-host agent: 88.1 in agentic, a surprising score for a self-hostable model.


Agentic ranking — Models that act on their own

The agentic ranking measures an LLM's ability to plan, execute, and correct multi-step tasks autonomously. It has become the most important criterion for enterprises in 2026.

Mythos and GPT-5.5: the leading duo

Claude Mythos Preview (100) achieves a perfect score. In practice, this means it successfully completes all the agentic scenarios in the benchmark: web navigation, file manipulation, API calls, and error correction without human intervention.

GPT-5.5 (98.2) follows very closely. Its advantage: the OpenAI agent ecosystem (Operators, custom GPTs) which makes production deployment easier. For agent use cases, see our dedicated article on the best LLMs for AI agents.

The "Deep Think" models gain ground

Gemini 3 Pro Deep Think (95.4) takes 3rd place in the agentic ranking compared to 6th overall. Its extended reasoning mode allows it to better plan action sequences. Claude Opus 4.7 Adaptive (94.3) also benefits from this: the adaptive mode shines when it detects an agentic task and allocates more compute.

The self-host phenomenon

Kimi K2.6 in self-host reaches 88.1, and GLM-5 Reasoning (self-host) 82. These scores raise a key question: why pay for proprietary APIs when a self-hosted model reaches this level? The answer depends on your latency and confidentiality constraints. To explore this option further, check out our guide to the best LLMs to run locally.


Benchmarks: what they really measure (and what they hide)

All the figures above come from benchmarks. But which ones are reliable? LLM benchmarks suffer from well-documented systemic biases: training data contamination, over-optimization, and a lack of representativeness of real-world use cases.

LMSYS Chatbot Arena: the imperfect benchmark

The LMSYS Chatbot Arena remains the gold standard. The principle: two models respond to the same prompt, and a human votes blindly. The resulting Elo score is robust because it reflects a genuine preference.

Known limitation: the tested prompts are short and generic. A model that excels on 50-word tasks might collapse on a 20-page document.

Artificial Analysis and Kagi: technical comparisons

Artificial Analysis cross-references quality, latency, and cost per token. It's essential for architectural choices. The Kagi benchmark adds an independent perspective that is useful for verifying that a model isn't over-optimized for a single benchmark.

LocalScore: for local models

If you are testing models locally, LocalScore is an open-source benchmark that evaluates performance on your actual machine, not in the ideal conditions of a datacenter. Essential before deploying via Ollama.

What no benchmark measures

Consistency. A model can score 90 on a benchmark and still produce mediocre responses 30% of the time in production. The benchmark measures potential, not operational reliability. Hence the importance of testing on your data, not just on public test sets.

To understand the underlying metrics (tokens, context window, costs), our article on LLM billing covers it comprehensively.


Costs and optimization — Don't pay 10x too much

Quality does not justify any price. GPT-5.5 is excellent, but if your use case is limited to document summaries, Claude Sonnet 4.6 (83 in general) will suffice for a fraction of the cost.

The "always the best model" trap

According to Karl Llorey's analysis, the majority of companies use a flagship model for tasks that would require a mid-tier model. Result: a bill multiplied by 5 to 10 with no measurable gain in quality.

The solution: automatically route requests. Simple task → Sonnet 4.6 or GPT-5.4. Complex task → Mythos Preview or GPT-5.5. Tools like LLM API Test allow you to measure the quality/cost ratio for each type of task.

Inference backend: the other expense item

The choice of inference backend directly impacts the cost per request. The BentoML benchmark compares the performance of vLLM, TensorRT-LLM, TGI, and others. Throughput differences reach 2-3x depending on the model and hardware.

On the hardware side, the AMD MI300X vs NVIDIA H100 on Mixtral 8x7B comparison shows that AMD GPUs are becoming competitive for inference, paving the way for significant cost reductions if you deploy on-premise.

Indicative cost table (May 2026, verify on official website)

Model Input (per M tokens) Output (per M tokens) Max context
Claude Mythos Preview ~30$ ~90$ 200K
GPT-5.5 ~25$ ~75$ 256K
Gemini 3.1 Pro ~15$ ~45$ 1M
Claude Opus 4.7 Adaptive ~10-30$* ~30-90$* 200K
GPT-5.4 ~8$ ~24$ 128K
Claude Sonnet 4.6 ~3$ ~15$ 200K
DeepSeek V4 Pro Max ~2$ ~8$ 128K

*Variable price depending on the selected adaptive mode.

These rates evolve rapidly. For free options, our best free LLMs page is updated monthly.


French Models — Is Mistral Still Viable?

Mistral 3 was announced as the European answer: an open-source multimodal family (14B, 8B, 3B + Mistral Large 3 with 41B active / 675B total), Apache 2.0 license. On paper, it's ambitious.

In reality, Mistral 3's scores do not place it in the global top 15 of May 2026. The gap with Claude Mythos Preview or GPT-5.5 is significant. Mistral Medium 3.5, which powers the remote agents in Vibe, shows interesting agentic capabilities but not enough for the top 15.

Where Mistral Remains Relevant

Code. Devstral 2 and Codestral remain competitive code models, especially for self-hosting. The combination with Vibe CLI offers a complete developer workflow.

OCR. Mistral OCR 3 positions itself as a serious alternative for text extraction from scanned documents, a use case where the source language matters less than accuracy.

Local. Mistral 3's Apache 2.0 license allows deployment without constraints, unlike models from Anthropic or OpenAI. For sovereign architectures, this is a decisive argument.

Magistral: The New Challenger

Magistral is Mistral AI's latest addition. Early feedback is positive on reasoning tasks, but independent benchmarks are still lacking. One to watch.

For details on French-language options, check out our best LLMs in French page.


Multimodal — Beyond Text

The LLMs of May 2026 are no longer text models. The analysis of images, documents, and videos has become a major differentiating factor.

Vision: Which model for which use case?

Claude Mythos Preview and Gemini 3.1 Pro are the two best current vision models. Gemini benefits from native integration with Google Docs and Google Drive, which simplifies the analysis of complex documents. Claude excels with complex images (diagrams, tables, interfaces).

For reliable image analysis with LLMs, our guide on AI vision details testing protocols and pitfalls to avoid.

Multimodal Agents

The trend is toward agents that see and act. An agent can analyze a screenshot, identify a problem in an interface, and trigger a corrective action. Claude Mythos Preview (100 in agentic) and GPT-5.5 (98.2) are the only models currently capable of maintaining this level of reliability over long multimodal chains.

Avatars and Generation

AI avatars represent an emerging use case for multimodal models. If your need is focused on creating realistic avatars rather than analysis, check out our selection of the best AI tools to find the right solutions.


Research — LLMs that find, not those that make up

For factual research, the overall leaderboard is misleading. A model that excels at reasoning can hallucinate on specific facts. This is where specialized approaches take the lead.

Dedicated research models

Gemini 3.1 Pro benefits from direct access to Google Search, making it a formidable research tool. GPT-5.5 relies on OpenAI's browse. But for a structured research workflow, dedicated tools remain superior.

Our comparison of the best LLMs for research details options like Perplexity and NotebookLM that combine retrieval augmented generation (RAG) and source citation.

The role of research benchmarks

The Kagi benchmark includes factual accuracy metrics that general benchmarks ignore. The IEEE Spectrum analysis points out that the "helpfulness" metrics in popular benchmarks can penalize models that answer "I don't know" — precisely the most reliable ones for research.

In-house RAG: the prerequisites

If you are building your own RAG pipeline, the choice of the encoding LLM and the generation LLM are two distinct decisions. DeepSeek V4 Pro and Claude Sonnet 4.6 offer excellent quality/cost ratios for generation on well-structured RAG contexts.


Practical choices — Which model for which profile?

Solo developer

For daily coding, two options: GPT-5.3 Codex (87 in general, code-optimized) or Devstral 2 from Mistral locally. If you use Vibe CLI, the Mistral ecosystem is consistent. Otherwise, GPT-5.5 remains the most versatile. Our meilleurs LLM pour coder page details IDE configurations.

SaaS Startup

Route your requests. Claude Sonnet 4.6 for 80% of tasks (summaries, extraction, classification), Mythos Preview for the 20% complex ones (multi-step reasoning, agents). Budget divided by 3-4 with no loss of perceived quality. The benchmark LLM API Test will help you calibrate this routing.

Enterprise / Sovereignty

Mistral 3 (Apache 2.0) or DeepSeek V4 Pro in self-host, on AMD MI300X infrastructure to reduce NVIDIA dependency. Kimi K2.6 self-host (88.1 in agentic) is a serious option for autonomous agents in a controlled environment. Consult The SOTA to track the evolution of open-source models.

Academic research

Gemini 3.1 Pro for Google access, or a custom RAG pipeline with a local model for data confidentiality. GLM-5.1 (83) deserves a test if your data is predominantly in Chinese or academic English.


❌ Common mistakes

Mistake 1: Choosing a model solely based on its overall score

A general score of 90 guarantees nothing for your specific use case. A model ranked 15th can outperform the 1st on a niche task. Solution: benchmark on 50-100 examples representative of your production, not on public datasets.

Mistake 2: Ignoring the context window cost

Claude Mythos Preview and Gemini 3.1 Pro offer 200K and 1M tokens of context. But you pay for every token sent, including those that are useless. If you load a 100K token document to extract information on page 2, you pay for 100K tokens of input. Solution: smart chunking before sending to the LLM.

Mistake 3: Using an agentic model like a chatbot

Claude Mythos Preview (100 in agentic) is designed to plan and execute autonomous tasks. Using it for simple questions is like paying for a Ferrari to go to the corner of the street. Solution: reserve agentic models for multi-step workflows, use a mid-tier model for the rest.

Mistake 4: Comparing incomparable benchmarks

The LMSYS score and the Kagi score do not measure the same thing. One reflects human preference, the other factual accuracy. Citing both without context is misleading. Solution: choose a benchmark aligned with your priority (user preference vs. accuracy) and stick to that one.

Mistake 5: Neglecting latency

A model that is 5% better but 3x slower can degrade the user experience. For real-time applications (chat, autocomplete), P99 latency matters more than the benchmark score. Solution: use Artificial Analysis to cross-reference quality and latency.


❓ Frequently Asked Questions

Is Claude Mythos Preview really a 100 in agentic?

The perfect score means the model passed 100% of the scenarios in the agentic benchmark used. In practice, your mileage will vary depending on the complexity of your workflows. Expect an 85-95% success rate on unoptimized real-world tasks.

Is GPT-5.5 better than Gemini 3.1 Pro?

Overall: yes (91 vs 92, almost identical). In agentic: clearly (98.2 vs 87.3). In multimodal and long context: Gemini wins (1M tokens, Google integration). The "best" depends on your primary use case.

Is DeepSeek V4 Pro reliable for production?

The overall score of 88 is solid. However, the documentation, SLA, and tool ecosystem still lag behind OpenAI or Anthropic. For prototyping or self-hosting: yes. For critical customer service: wait for more production feedback.

Are "adaptive" models like Claude Opus 4.7 really more cost-effective?

Yes, in theory. The model allocates less compute on simple tasks, resulting in reduced costs. In practice, adaptive billing is complex to predict. Test it on your actual workload before relying on it for budgeting.

Is Mistral dead?

Not at all. Mistral remains relevant in code (Devstral 2, Codestral), OCR (Mistral OCR 3), and sovereignty (Apache 2.0). What is dead is the idea that Mistral could compete on the overall leaderboard with Anthropic and OpenAI. It's a specialized tool, not a generalist.

What is the best free LLM in May 2026?

The best free options are detailed in our guide to the best free LLMs. In short: Gemini 3.1 Pro (free with a Google quota), Mistral's Le Chat, and local models via Ollama.


✅ Conclusion

Claude Mythos Preview totally dominates the LLM landscape of May 2026, but the best model for you depends on your use case, your budget, and your latency constraints. The real best practice in 2026 is no longer to choose a model: it's to intelligently route between several based on the complexity of each request. To refine your choice, check out our comprehensive monthly comparison updated at the beginning of each month.