📑 Table of contents

FrontierCode: Cognition's benchmark that buries SWE-Bench and ranks code agents by the real quality of pull requests — Fable 5 at 46.3%, Opus 4.8 at 34.3%, GPT-5.5 at 25.5%

LLM & Modèles 🟢 Beginner ⏱️ 15 min read 📅 2026-06-26

FrontierCode : Cognition's benchmark that buries SWE-Bench and ranks code agents on the real quality of pull requests — Fable 5 at 46.3%, Opus 4.8 at 34.3%, GPT-5.5 at 25.5%

🔎 SWE-Bench is dead, and Cognition just drove the final nail into its coffin

For two years, SWE-Bench reigned supreme over the ranking of code agents. Every week, a new model announced it had beaten the previous one on this benchmark. Except that no one was checking whether the generated code was actually usable in production.

On June 8, 2026, Cognition (the creators of Devin) published FrontierCode, a benchmark of 150 tasks that no longer measures functional correctness, but real mergeability. The verdict is unequivocal: scores collapse, rankings explode, and SWE-Bench goes from "benchmark" to "statistical decoy".

This is a turning point. And market figures prove Cognition right: AI coding is now worth $9.3 billion in 2026 (BuildFastWithAI, June 26, 2026), and companies no longer tolerate PRs that pass tests but destroy the codebase.


The Essentials

  • FrontierCode evaluates 5 real quality criteria: correctness, test quality, scope discipline, style, adherence to repo standards — not just "tests pass".
  • Fable 5 dominates at 46.3% on the Main set (100 tasks), but remains behind a paywall and export controls, making it unusable for the majority of teams.
  • Opus 4.8 reaches 34.3% Main and only 13.4% on Diamond (the 50 hardest tasks), showing that even the best public model is far from production level.
  • GPT-5.5 caps at 25.5% Main and 6.3% Diamond, with a 21-point gap with Fable 5 — the largest gap ever observed on a public coding benchmark.
  • 81% fewer false positives than SWE-Bench Pro thanks to innovative grading techniques (Reverse-Classical, Code Scope, Adaptive Classical).
  • The AI coding market is growing at 26% per year, Claude Code holds ~40% market share, and Anthropic has become profitable ($559M operational Q2 2026, $47G ARR) thanks to its focus on code.

Tool Main usage Price (June 2026, check on site) Ideal for
Claude Code Agentic coding agent Pro/Enterprise subscription Pro teams, highest quality
Cursor Integrated AI IDE Starting at $20/month Individual developers
Devin Autonomous full-stack agent Enterprise Complex multi-file tasks
Codex CLI OpenAI terminal agent Free (API) Scripts and automation

What FrontierCode really is — and why SWE-Bench is no longer enough

FrontierCode is a benchmark created by Cognition in collaboration with over 20 open-source maintainers from 36 flagship repos. Each task requires a minimum of 40 hours of human work. There are 150 tasks in total, divided into three tiers of difficulty.

The fundamental difference with SWE-Bench? FrontierCode doesn't ask "does the code solve the problem?" but "would you merge this PR?"

This is a complete paradigm shift. As highlighted by Artificial Analysis with DeepSWE, SWE-Bench had accumulated methodological flaws: overly detailed prompts that gave away the solution, tests that didn't check for regressions, and above all, no code quality criteria.

FrontierCode fixes all of this with five grading dimensions:

  1. Correctness: does the code do what is asked?
  2. Test quality: are the added tests meaningful and do they cover edge cases?
  3. Scope discipline: does the PR stay within its bounds, without modifying off-topic files?
  4. Style: does the code follow the repo's conventions?
  5. Adherence to standards: does the code follow existing architectural patterns?

According to StartupHub.ai, FrontierCode's prompts are three times shorter than those of SWE-Bench Pro, which eliminates the "prompt that gives away the answer" bias. A striking example: on a C++ task from the jsonschema repo, Opus 4.8 produces functionally equivalent but idiomatically incorrect code. SWE-Bench validates it, FrontierCode rejects it.


The results: a ranking that shakes up the market

Diamond Results (50 hardest tasks)

The Diamond results are the real stress test. No model exceeds 15%.

Model Diamond Score Main Score Tokens used
Fable 5 ~23% (estimated) 46.3% N/A (proprietary)
Claude Opus 4.8 13.4% 34.3% High baseline
GPT-5.5 6.3% 25.5% ~4x less than Opus 4.8
Gemini 3.1 Pro 4.7% N/A N/A
Kimi K2.6 3.8% 16.0% Open-source

Source: Cognition Blog, June 8, 2026 and BuildFastWithAI, June 26, 2026

Extended Results (150 tasks)

The Extended set offers a more complete view but smooths out the difficulties. Opus 4.8 reaches 51.8% on the full 150 tasks, which seems respectable until you look at Diamond: 13.4% means that on the most complex tasks, the model produces non-mergeable code 86 times out of 100.

The 21-point gap between Fable 5 (46.3% Main) and GPT-5.5 (25.5% Main) is the largest gap ever measured on a public coding benchmark. This means that production code generation is not a linear problem — there is a qualitative wall that most models do not break through.

The discussion on Hacker News sums up the situation well: "3,000 threads about code quality, and this is the first benchmark that actually measures whether the code would be merged." On r/mlscaling, researchers praise "the strongest available signal on a model's ability to write maintainable code."


The three grading innovations killing SWE-Bench

Reverse-Classical Grading

In classical SWE-Bench, existing tests are run on the modified code. If everything passes, it's good. The problem: an agent can simply delete the failing tests or modify the code to pass the tests without actually solving the issue.

Reverse-Classical does the opposite: AI-generated tests must fail on the original code (before modification). If the tests pass on the original code, it means they aren't testing anything relevant. It's a minimal check, but devastatingly effective against empty tests.

Code Scope

Code agents have a recurring flaw: they modify too many files. A bug in a JSON parser turns into a logging architecture refactor because the agent "takes the opportunity". Code Scope imposes automatic constraints on modified files. If the PR touches files outside the scope, it is penalized, regardless of the functional correctness.

This is the criterion that drops scores the most for the most talkative models. Models that "think too much" tend to extend the scope of their modifications.

Adaptive Classical Grading

The repo's reference tests are often insufficient or obsolete. Adaptive Classical uses an LLM to adapt and extend these tests before running them, combining the rigor of the original tests with improved coverage. It's a pragmatic compromise between the status quo and a total rewrite of the test suites.

According to CryptoBriefing, these three techniques combined reduce false positives by 81% compared to SWE-Bench Pro. In plain terms: 81% of the PRs that SWE-Bench validated would be rejected by human maintainers.


The Fable 5 paradox: best model, but inaccessible

Fable 5 dominates FrontierCode with 46.3% on Main. This is an impressive result that creates a considerable gap with the rest of the pack. But there is a major problem: Fable 5 is not available.

The model is behind an enterprise paywall and subject to export controls that limit its access outside the United States. For the vast majority of development teams worldwide, Fable 5 is a theoretical benchmark, not a usable tool.

This situation creates a perverse distortion in the market. Companies see a score of 46.3% and expect that level of quality. But when they take action with the meilleurs LLM pour coder that are actually accessible, they hit Opus 4.8's 34.3% or GPT-5.5's 25.5%.

The 21-point gap between Fable 5 and GPT-5.5 isn't just a number. It's the difference between "an agent that produces code you quickly review" and "an agent whose every PR requires more cleanup work than writing the code yourself." It's a brutal reminder that public benchmarks and real-world accessibility are two different worlds.


Opus 4.8 vs. GPT-5.5: the cost-intelligence ratio

The battle between Anthropic and OpenAI on FrontierCode tells an interesting story. Opus 4.8 clearly dominates GPT-5.5, especially on Diamond (13.4% vs. 6.3%, a ratio of more than 2:1).

But GPT-5.5 uses up to 4 times fewer tokens than Opus 4.8 for the same tasks, according to data from Cognition. This makes GPT-5.5 the champion of the cost-intelligence ratio: for a given budget, you can make 4 times more attempts with GPT-5.5, which can compensate for a lower individual success rate.

In practice, the choice depends on the use case. For critical PRs where every line of code counts (financial systems, critical infrastructure), Opus 4.8 is the only rational choice despite the cost. For less sensitive code where volume takes priority, GPT-5.5 offers a better ROI.

Gemini 3.1 Pro, with 4.7% Diamond, confirms that Google remains behind on high-quality agentic coding, despite the impressive performance of Gemini 3.5 Flash on agent benchmarks in other contexts. Speed is not quality.

Kimi K2.6, the best open-source model in the benchmark with 3.8% Diamond and 16% Main, deserves a mention. For teams looking to run an LLM locally or self-host, it's the only viable option in this ranking, even though the gap with proprietary models remains considerable.


The AI coding market in 2026: $9.3 billion and two opposing strategies

The AI coding assistant market reached 9.3 billion dollars in 2026 according to BuildFastWithAI, with a 26% annual growth. Other estimates, such as IdeaPlan's, place the market at 12.8 billion USD in 2026, projected to reach 30.1 billion USD by 2032.

But the most revealing figures are the market shares. According to data compiled by Agentic.ai and The Pragmatic Engineer :

Tool Market share (June 2026) Positioning
Claude Code ~40-46% Leader, agentic coding
Codex (OpenAI) ~21% Challenger, terminal-first
Cursor ~19% Integrated IDE
GitHub Copilot ~9% Legacy, completion

Claude Code dominates with around 40% of the market. This is a remarkable result for a tool released less than two years ago. Anthropic's strategy is clear: do not build an IDE, do not do line completion, but focus exclusively on the agent that writes, tests, and submits code.

This strategy is paying off financially. According to CNBC, Anthropic generated 4.8 billion USD in Q1 2026, with projected revenue of 10.9 billion USD in Q2 2026. The company achieved its first operating profit: 559 M$ in Q2 2026. The annualised revenue run-rate reaches 47 billion USD.

Claude Code alone generates over 1 billion USD in annualized revenue according to SERPsculpt. A single product, a single use case, one billion dollars.


Anthropic profitable, OpenAI at a loss: coding as an economic model

The contrast with OpenAI is striking. While Anthropic pockets its first operating profit thanks to its coding focus, OpenAI is facing projected losses of 14 billion USD for 2026, according to market analyses cited by Agentic.ai.

The lesson is clear: AI coding is not a demo market. It's a market where companies pay a premium (Claude Code Pro and Enterprise plans are among the most expensive on the market) because the ROI is measurable and direct. Every developer hour saved translates into dollars. Every quality PR avoids a production incident.

OpenAI, with its "something for everyone" strategy (consumer chatbot, image, video, agents), dilutes its technical advantage. GPT-5.5 is an excellent generalist model, but on the specific criterion that generates the most revenue — production code quality — it is 21 points behind Fable 5 and more than 8 points behind Opus 4.8.

Anthropic made the opposite bet: being the best at a high-value use case. On the best LLMs for AI agents, Claude is now the default benchmark for code. This is no accident.


What FrontierCode reveals about the true state of code agents

Agents don't know how to stay within their scope

The Scope Discipline criterion is probably the most revealing aspect of FrontierCode. Current code agents have a pathological tendency to "expand" the perimeter of a task. A bug fix in one module turns into a refactoring of three adjacent modules.

In production, this is a nightmare for reviewers. A 50-file PR for an off-by-one bug is a PR that gets rejected as a precautionary measure. FrontierCode penalizes exactly this behavior, and the scores suffer massively as a result.

AI-generated tests are often meaningless

Reverse-Classical Grading exposes a systemic problem: agents write tests that pass on everything, including the original buggy code. These tests give an illusion of coverage without any real value. It's the coding equivalent of the phenomenon that DeepWeb-Bench exposed in search agents: results that look correct but verify nothing substantial.

Stylistic quality remains the preserve of humans

The Style criterion is where the models fail the most silently. The code passes the tests, stays within the scope, but doesn't respect the language's idioms or the repo's conventions. The C++ example from jsonschema cited by StartupHub.ai is exemplary: correct code that "smells" like AI, and which any senior developer instinctively rejects.


The limits of FrontierCode

150 tasks is still a small number

Even if each task represents 40+ hours of human work, 150 tasks is a statistically fragile sample. A model optimized on a subset of the repos could artificially inflate its score. Cognition plans to expand the benchmark, but for now, caution is required when interpreting fine differences.

Maintainer bias

The 20+ maintainers who designed the tasks have their own preferences and standards. Code that would be merged in one repo could be rejected in another. The subjectivity of "mergeability" is partially mitigated by the 3,000 grading rubrics, but not eliminated.

The absence of performance metrics

FrontierCode does not measure resolution time or cost per PR. A model that takes 30 minutes and $5 in tokens to solve a task is not differentiated from a model that takes 2 seconds and $0.10. In practice, cost and latency are decision-making criteria just as important as quality for teams.


❌ Common mistakes

Mistake 1: Confusing the FrontierCode score with actual productivity

A score of 34% on Main does not mean that 34% of coding work can be automated. It means that out of 100 tasks carefully selected for their difficulty, 34 PRs would be merged. In the daily workflow, with tasks that are often simpler, the usability rate is higher. But the opposite conclusion (believing that 66% of the code is unusable) is just as false.

Mistake 2: Directly comparing FrontierCode and SWE-Bench scores

Both benchmarks measure different things on different tasks. A model at 90% on SWE-Bench and 25% on FrontierCode is not "worse" — it is evaluated on radically stricter criteria. It's like comparing a dictation score (spelling) and an essay score (style, argumentation, structure).

Mistake 3: Choosing your AI coding tool based solely on the benchmark

Benchmarks are snapshots. Iteration speed, IDE integration, price, and latency matter just as much. Cursor with a 19% market share does not dominate FrontierCode, but its developer experience remains superior for many users. The best AI tools for code cannot be reduced to a single ranking.


❓ Frequently Asked Questions

Does FrontierCode permanently replace SWE-Bench?

Not yet. SWE-Bench remains useful for measuring basic functional correctness. But for evaluating a code agent for production use, FrontierCode has become the gold standard. The two benchmarks are complementary, not substitutive.

Why is Fable 5 not available?

Fable 5 is a model subject to US export controls and accessible only via an enterprise subscription. This combination of regulatory restrictions and a closed business model makes it unusable for the majority of developers and teams outside the United States.

Which model to choose for production coding today?

Claude Opus 4.8 via Claude Code offers the best quality-reliability ratio according to FrontierCode. GPT-5.5 is an excellent cost-performance trade-off. For open-source or local needs, Kimi K2.6 is the best available option.

Are Diamond scores too harsh?

13.4% for Opus 4.8 seems low, but Diamond groups together the 50 hardest tasks, each requiring 40+ hours of senior work. In reality, these scores faithfully reflect the difficulty of complex coding in production. The benchmark is not harsh, the problem is difficult.


✅ Conclusion

FrontierCode marks the end of the era of flavored benchmarks where every model could claim supremacy by cheating on the criteria. By measuring the actual mergeability of PRs — scope, tests, style, standards — Cognition has created the first faithful mirror of what developers have known from the start: code that passes tests isn't necessarily code that gets merged.

The numbers speak for themselves: Opus 4.8 at 34.3%, GPT-5.5 at 25.5%, and above all Diamond at a maximum of 13.4% for a public model. AI coding is powerful, but production quality remains an immense challenge. To go further on the meilleurs LLM pour coder and understand how these results translate into daily practice, the monthly ranking remains your best compass.