Anthropic Glasswing : Claude Mythos discovers 10,000+ critical vulnerabilities in a month — and the real problem is patching
🔎 When AI finds everything, but nobody fixes it
Anthropic just released the first results for Project Glasswing, and the numbers are staggering. In one month, Claude Mythos Preview helped 50 partner organizations — including Cloudflare and Mozilla — discover more than 10,000 high- or critical-severity vulnerabilities in critical software.
The bug discovery rate was multiplied by 10 compared to traditional fuzzing and manual audit methods. That's the triumphant part.
The awkward part: of the thousands of vulnerabilities reported to upstream maintainers, only 97 have been fixed to date. That's a patching rate of less than 1%.
The gap between discovery and remediation is not a new problem. But Glasswing turns it into a systemic crisis: AI amplifies detection capabilities without the patching ecosystem keeping up. We have never had so many known and unpatched vulnerabilities.
The key points
- Project Glasswing: 50 partners, 10,000+ high/critical vulnerabilities found in one month with Claude Mythos Preview.
- The discovery rate was multiplied by 10 compared to classic tools (fuzzers, SAST).
- Only 97 bugs were patched out of the thousands reported upstream — less than 1%.
- The bottleneck is no longer detection; it's the correction, validation, and deployment of the patch.
- The study Benchmarking Mythos-Linked Bug Rediscovery validates the reliability of Mythos's discoveries.
Recommended tools
| Claude Opus 4.7 (Adaptive) | Agentic code analysis, security audit | Variable price (June 2025, check on anthropic.com) | Security teams needing deep reasoning |
|---|---|---|---|
| GPT-5.5 | Code review, pattern detection | Variable price (June 2025, check on openai.com) | Developers integrating auditing into their workflow |
| DeepSeek V4 Pro (Max) | LLM-assisted static analysis | Variable price (June 2025, check on deepseek.com) | Teams looking for a cost-effective alternative |
| Hostinger | Secure hosting for scanning infra | Starting at 2,99 € (June 2025, check on hostinger.com) | Deploying self-hosted audit tools |
Project Glasswing: what it actually is
Glasswing is a program launched by Anthropic to test Claude Mythos Preview at real-world scale on production software. No synthetic benchmarks: real code, real dependencies, real users.
The principle is simple. Anthropic provides partners with access to Mythos Preview. The latter integrate it into their existing security pipelines. The results are collected, verified, and reported back to the maintainers of the projects concerned.
The sandboxing infrastructure behind Mythos is documented in the study Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure. Anthropic developed a formal verification system based on Z3 to ensure that the model cannot escape the sandbox during the analysis of potentially malicious code.
This is a crucial point: we are literally asking an LLM to look for exploits in code. If the sandbox is defective, we create an attack tool, not a defense tool.
First month figures: 10,000+ and 50 partners
The data published by Anthropic dans son update initial is unequivocal.
50 organizations participated in this first phase. Among them, Cloudflare and Mozilla are the most visible, but the program also includes cloud infrastructure companies, open source software publishers, and internal security teams from large enterprises.
The volume: more than 10,000 vulnerabilities classified as high or critical according to CVSS severity. This is not scan noise. The Benchmarking Mythos-Linked Bug Rediscovery study shows that Mythos achieves a known bug rediscovery rate close to 94%, which validates that the 10,000 findings are not massive false positives.
The 10x factor is not a marketing estimate. It is the direct comparison between Mythos's results and those of pre-existing tools (AFL++, libFuzzer, commercial SAST) on the same targets, during the same period.
In terms of models involved, Claude Mythos Preview relies on the Claude Opus family. Claude Opus 4.7 (Adaptive), which scores 94.3 on agentic benchmarks and 90 overall, is its main engine. Code reasoning tasks benefit directly from this agentic capability.
How Mythos finds vulnerabilities that traditional tools miss
The fundamental difference between Mythos and a traditional fuzzer is its semantic understanding of the code.
A fuzzer like AFL++ generates random or mutated inputs to crash a program. It is blind to business logic. If it doesn't reach a particular code path, it will never find the vulnerability hidden there.
Mythos, on the other hand, reads the source code. It understands invariants, assertions, and state transitions. It identifies places where an invariant is supposed to hold but where nothing actually enforces it. It then generates targeted inputs to precisely violate this invariant.
The study AI Governance and Accountability: An Analysis of Anthropic's Claude raises an interesting point regarding the governance of this process: when an LLM finds a vulnerability, who is responsible for the disclosure? Anthropic has opted for a model where the partner retains total control over the disclosure flow. Mythos is a tool, not an arbiter.
The typical pipeline in Glasswing looks like this: the partner targets a component → Mythos analyzes the code and dependencies → it generates PoCs (Proofs of Concept) in a formally verified sandbox → the PoCs are validated by the partner's security team → the findings are reported upstream.
This is where the rub lies.
The real scoop: 97 patches out of 10,000+ findings
The Hacker News reports that only 97 bugs have been fixed at this stage. The figure is confirmed by Security Boulevard, which describes the obstacles to patching exposed by the program.
Less than a 1% fix rate. That is the number that should keep CISOs awake at night.
Why such a gap? Several factors converge.
First, the volume. Open source maintainers already receive hundreds of reports per month. Adding 200 valid findings at once to a single project is not a relief, it's a cognitive overload.
Second, trust. An AI-generated vulnerability report does not have the same credibility as a report from a human researcher with a known track record. Maintainers want to validate each finding manually. With 10,000 findings, validation alone takes months.
Third, the complexity of the patches. A buffer overflow vulnerability is relatively simple to fix. But Mythos finds complex logic bugs — broken invariants in state machines, subtle race conditions, authorization flaws in permission graphs. Fixing these bugs without breaking backward compatibility requires architectural work, not a quick fix.
The anatomy of the patching bottleneck
The problem is not new. The security industry was already aware of the gap between discovery and remediation. The NSA/CISA report on the most exploited vulnerabilities repeats it every year: organizations do not patch fast enough.
But Glasswing changes the scale. We go from a dripping faucet to a fire hose.
The typical patching pipeline has five steps: discovery → validation → remediation → review → deployment. Before Mythos, the bottleneck was on discovery. Now, it has shifted to the next four steps, and they do not scale in the same way.
Validation is linear: a human must verify each finding. Remediation is linear: a developer must write each patch. Review is linear: a maintainer must read and approve each PR. Deployment is constrained by release cycles.
Only discovery scales exponentially, because it is the only step where parallelizable computation does the work. Result: we create a massive downstream bottleneck.
This is a flow engineering problem, not an AI problem. And it needs to be addressed as such.
Partners speak: Cloudflare, Mozilla and the rest
Cloudflare has been one of the most active partners in the program. Their infrastructure handles more than 20% of global web traffic. Any flaw in their stack is potentially catastrophic.
Their security teams have integrated Mythos into their internal audit pipeline. Result: hundreds of findings in internal components and open source dependencies. But Cloudflare also has the resources to handle a high volume of reports — which is not the case for a solo maintainer of an npm package with 2 million downloads.
Mozilla, for its part, used Mythos on Firefox components and their sync infrastructure. The findings include bugs in the parsing of complex web formats — exactly the type of code where traditional fuzzers are already good, but where Mythos found attack vectors that the fuzzers did not reach.
The contrast between these large players and small maintainers is the heart of the problem. Anthropic is aware of this asymmetry, but solving it goes beyond the scope of a research program.
Mythos Governance: A Model Under Tension
The study AI Governance and Accountability: An Analysis of Anthropic's Claude precisely analyzes the tension Anthropic faces with this type of tool.
On one hand, making Mythos more aggressive increases the number of discoveries. On the other hand, each false positive generates unnecessary work for maintainers, and each unpatched true positive creates liability — or at least a reputational risk.
Anthropic has chosen a conservative balance: Mythos produces a low false positive rate (validated by the re-discovery benchmark), but the model makes no disclosure decisions. It is always the human partner who decides when, how, and to whom to report the bug.
This governance model is reasonable, but it does not solve the volume problem. The moral responsibility of creating 10,000 uncorrected findings remains unclear. Anthropic provides the tool, the partner makes the report, the maintainer does not patch. Who is at fault? No one, and everyone.
What this means for AI-assisted audit models
The Benchmarking Mythos-Linked Bug Rediscovery benchmark establishes an important methodology: to evaluate an AI vulnerability discovery tool, it is not enough to measure the number of findings. You must measure the re-discovery rate on known bugs (to calibrate sensitivity) and the false positive rate (to calibrate specificity).
Mythos performs extremely well on both of these metrics. This means that the 10,000 findings are likely mostly real bugs. The problem is not the quality of the findings, it is the throughput of the remediation pipeline.
For teams considering using LLMs for code auditing — whether with Claude Opus 4.7 or GPT-5.5 — the lesson from Glasswing is clear: only deploy an augmented discovery tool if you have planned an augmented remediation pipeline to match. Otherwise, you are turning a hidden problem into a visible but still unsolved problem.
In the context of the comparison of the best LLMs for coding, Claude Opus 4.7 (Adaptive) stands out precisely on this type of task: its agentic score of 94.3 reflects its ability to maintain coherent reasoning over long code analysis traces, which is essential for a security audit.
Sandbox infrastructure: why it's non-negotiable
The study Mythos and the Unverified Cage details the security architecture surrounding Mythos. This is a point often overlooked in media coverage, but it is central.
When you ask an LLM to analyze code for vulnerabilities, you are effectively giving it the ability to understand and generate exploits. If the sandbox isolating the model is faulty, the model can use this understanding to escape.
Anthropic uses a formal verifier based on Z3 (an SMT solver) to mathematically prove that certain security properties of the sandbox hold. This is not testing, it is formal proof. It is computationally expensive, but necessary when the model being analyzed is itself a frontier model.
This approach is consistent with Anthropic's security philosophy, which has always favored formal verification over empirical testing for critical security issues. But it also raises the question: is this sandbox infrastructure accessible to Glasswing partners, or does it remain internal to Anthropic?
Anthropic and the computational power dynamic
The broader context of Glasswing fits into Anthropic's computational strategy. Let's recall that Anthropic signed with SpaceX for Colossus 1: 220,000 GPUs and 300 MW for Claude. This level of computational investment is not solely intended for model training. It also powers infrastructures like Mythos's formal sandboxing and the agentic capabilities that make Glasswing possible.
At the same time, the launch of Claude Code Agent View: the dashboard that kills the split-screen terminal shows that Anthropic is thinking about the developer experience end-to-end. Mythos's security audit is not an isolated tool — it integrates into an ecosystem where the Claude agent works in a visible and traceable manner.
And research on Anthropic Dreaming: Claude agents learn from their dreams between sessions suggests that future versions of Mythos could improve their detection capabilities between analysis sessions, without full retraining. An agent that "dreams" of vulnerability patterns between two scans could theoretically refine its understanding of bug classes.
Comparison: Mythos vs. classic approaches
| Approach | Discovery rate (relative) | False positive rate | Cost per scan | Scalability |
|---|---|---|---|---|
| Fuzzing (AFL++, libFuzzer) | 1x (baseline) | Very low | Low (CPU) | High (parallelizable) |
| Commercial SAST | 0.3-0.5x | High (30-70%) | Medium | High |
| Manual audit | 0.8-1.2x | Almost zero | Very high | Very low |
| Claude Mythos Preview | 10x | Low (< 6%) | High (GPU) | Medium (limited by infra) |
The table clearly shows the trade-off. Mythos crushes the other approaches in terms of discovery rate, with a controlled false positive rate. But the cost per scan is significantly higher, and scalability is constrained by GPU availability — hence the strategic importance of investments like Colossus 1.
❌ Common mistakes
Mistake 1: Confusing finding volume with security level
Finding 10,000 vulnerabilities does not make a system more secure. Only fixed vulnerabilities improve security. Presenting the raw number as a victory is misleading, and Anthropic doesn't do this either — they themselves highlight the patching problem.
Mistake 2: Deploying Mythos (or any auditing LLM) without a remediation pipeline downstream
If your security team can process 50 findings per month and the tool generates 500, you haven't improved your security. You've created an anxiety-inducing backlog. Size the discovery throughput to match the remediation throughput.
Mistake 3: Neglecting the sandbox
Running an LLM on unsandboxed code to look for vulnerabilities is like giving a knife to someone in a gas-filled room. The study on Z3-Based Verification shows that even Anthropic makes no compromises on this. Neither should you.
Mistake 4: Upstreaming findings in bulk
A maintainer who receives a report with 200 vulnerabilities all at once will classify it as spam. Prioritize, sort by exploitability, and upstream based on the target project's absorption capacity.
❓ Frequently Asked Questions
Is Claude Mythos available to the public?
No. For now, Mythos Preview is only accessible as part of Project Glasswing, by invitation from Anthropic. There is no public API or self-hosted version.
What types of software do the 10,000 vulnerabilities affect?
A mix of critical open source libraries (parsing, crypto, networking), cloud infrastructure components, and proprietary code from partners. The most common CVSS categories are injections, memory corruptions, and authorization vulnerabilities.
Is a 1% patching rate really surprising?
Not if you look at historical data. The average patch time for a CVE vulnerability is 60 to 120 days according to industry studies. With 10,000 findings in one month, even a patch time of 30 days would yield a very low remediation rate on day zero. The problem is volume, not individual slowness.
Does Mythos replace fuzzers?
No, it's complementary. Fuzzers remain excellent for finding crashes in parsers and binary formats, at a very low cost. Mythos excels at business logic, state machines, and permission graphs. Together, the two cover more ground.
Which Claude model is used behind Mythos?
Claude Mythos Preview is based on the Claude Opus family, with specific adaptations for security reasoning. Claude Opus 4.7 (Adaptive), Anthropic's most powerful agentic model (94.3 on the agentic benchmark), is the likely base.
✅ Conclusion
Glasswing proves that Claude Mythos can multiply vulnerability discovery by 10 — but above all, it proves that the software industry's patching pipeline is not scaled for this reality. 10,000 findings and 97 patches: the bottleneck has shifted sides, and it must be addressed with the same urgency we put into improving detection. The next battle in software security will not be won with better scanners, but with better remediation processes.