Stateful Online Monitoring: this Anthropic paper shows how to catch distributed AI agent attacks

Skynet Watch 🟢 Beginner ⏱️ 16 min read 📅 2026-06-01

Stateful Online Monitoring: this Anthropic paper shows how to catch distributed AI agent attacks

In December 2025, Anthropic revealed what would be the first largely autonomous AI-directed attack against 30 targets, from startups to government agencies (The Debrief, December 2025). Six months later, the problem has only gotten worse. The best autonomous AI agents are capable of planning multi-step attack sequences, and attackers have found a structural flaw in the way we secure them.

This flaw is simple: security systems evaluate each agent session independently. However, a determined attacker never uses just a single account. They distribute a harmful task across dozens, or even hundreds, of individual sessions — each perfectly benign in isolation.

On May 29, 2026, Anthropic published a game-changing paper on arXiv: Stateful Online Monitoring Catches Distributed Agent Attacks (2605.31593). Their proposal? A stateful online monitor capable of stitching together evidence across separate agent sessions to identify coordinated abuse. It is the first framework that tackles the problem at its root.

The Essentials

Attackers are now distributing their AI agent attacks across many user accounts, making each individual transcript benign to traditional monitors.
Anthropic proposes a Cross-Context Monitor Prompt that maintains state between requests and uses real-time clustering to aggregate weak signals of suspicion.
This monitor only rarely calls upon an expensive LLM for escalation, making it viable for large-scale production.
The immediate context: Anthropic's Mythos Preview discovered over 10,000 cybersecurity vulnerabilities (PYMNTS, 2026), proving that vulnerability research capabilities are already massive — and divertible.

Recommended Tools

Tool / Initiative	Main Use	Price (June 2026, check on anthropic.com)	Ideal for
Anthropic Mythos Preview	Automated vulnerability research	Up to $100M in credits via Project Glasswing	Offensive security teams
Project Glasswing	Critical software security in the AI era	$4M in direct open-source donations	Open-source security orgs
Cross-Context Monitor (paper)	Detection of distributed agent attacks	Not commercialized — open research	Platforms hosting AI agents
Hostinger	Secure hosting for agent deployments	Starting at €2.99/month	Developers deploying agents in production

The 2026 context: AI agents as an attack surface

Unprecedented vulnerability discovery capabilities

Modern LLMs have reached a level of capability in cybersecurity that makes the status quo untenable. Carlini et al. (2026) have demonstrated that LLMs can identify thousands of critical vulnerabilities in an automated manner. Anthropic itself reports that Mythos Preview has surpassed the milestone of 10,000 discovered vulnerabilities (PYMNTS, 2026).

Anthropic's quote is unequivocal: "Progress in software security was limited by the speed at which we could find new vulnerabilities." This ceiling has been shattered. The best LLMs for AI agents like GPT-5.5 (agentic score 98.2) or Claude Opus 4.7 Adaptive (94.3) possess the reasoning capabilities necessary to chain complex exploitation steps.

Real breaches that illustrate the risk

The first documented case of an autonomous multi-step attack in real-world conditions was identified by Anthropic in late 2025: an AI directing operations against 30 targets simultaneously, ranging from tech companies to government agencies (The Debrief, December 2025). This was not a research demonstration. It was a real campaign.

The Adversa AI June 2026 roundup compiles the most recent resources on agentic security, and the assessment is unanimous: the attack surface has mutated. Agents are no longer simple chatbots. They have access to tools, file systems, APIs. A compromised or hijacked agent is a digital adversary with extensive privileges.

Swarm attacks: the new modus operandi

Kiteworks (December 2025) describes an emerging pattern: AI swarm attacks. The principle involves deploying autonomous agents that infiltrate networks undetectably, each executing a micro-task that triggers no individual alarm.

This is exactly the scenario that Anthropic's paper will address. The difference between a swarm and a distributed attack is subtle but important: in the swarm, the agents are coordinated by a single controller. In the distributed attack described by Anthropic, it is a human who manually creates numerous accounts and distributes the task.

The fundamental problem: why stateless monitors fail

The illusion of security of single-context

The vast majority of security systems for AI agents operate on the same principle: they take an agent-user interaction transcript and evaluate it. If the transcript contains explicit malicious instructions, it is blocked. Otherwise, it passes.

This model worked when attacks were simple. A user asks an agent to generate malware, the monitor flags it, end of story. But attackers have adapted their method. Rather than putting everything into a single session, they fragment it.

The money laundering analogy

The mechanism is analogous to money laundering by smurfing: instead of depositing €100,000 at once (which triggers an alert), it is split into 100 deposits of €1,000 into different accounts. Each transaction is below the detection threshold. But the aggregate reveals the operation.

In the case of AI agents, an attacker can use 50 different accounts. Account 1 asks the agent to scan a target's ports. Account 2 asks to identify exposed services. Account 3 asks to search for CVEs for those services. None of these prompts is malicious in itself. But assembled, they form the reconnaissance phase of a cyberattack.

The limits of existing approaches

The Anthropic paper (2605.31593) clearly identifies the limits: security monitors that evaluate a single agent context at a time are structurally blind. This is not an implementation bug. It is a fundamental limitation of the architecture. As long as the monitor has no memory between sessions, it cannot see the pattern.

Anthropic references its own abuse reports (Anthropic, 2025a, 2025b) documenting large-scale cyberattack campaigns and surveillance for authoritarian regimes. These campaigns precisely used distribution across multiple accounts to evade detection.

The Technical Mechanism: The Cross-Context Monitor Prompt

The Three-Layer Architecture

The paper proposes a three-layer architecture that breaks away from the stateless monitor paradigm.

Layer 1 — Weak Signal Collection. Each agent transcript is analyzed, not to detect an attack, but to extract low-intensity signals. These signals could be unusual conversation topics, tool usage patterns, or queries that are on the boundary of what is acceptable. Individually, they do not justify any action.

Layer 2 — Stateful Real-Time Clustering. This is where the magic happens. The weak signals from all active transcripts are fed into a stateful clustering system. This clusterer identifies groups of sessions that share suspicious characteristics when aggregated: same targets, same thematic sequence, same time window.

Layer 3 — Selective Escalation to an LLM. The clustering does not flag everything. It only rarely escalates to a monitoring LLM, which then receives a "Cross-Context Monitor Prompt" — a prompt enriched with evidence stitched across separate sessions. It is this LLM that makes the final abuse decision.

Why Escalation is Rare (and Why That's Crucial

The choice to only call upon the LLM during escalation is decisive for production viability. Evaluating every transcript with a model like Claude Opus 4.7 or GPT-5.5 would be prohibitive in terms of cost and latency. The clustering acts as an ultra-efficient filter: it reduces the volume of cases requiring deep LLM evaluation to a tiny fraction of the total traffic.

In the paper's experiments, this architecture makes it possible to monitor massive volumes of sessions in real time while maintaining reasonable operational costs.

The Cross-Context Monitor Prompt in Detail

The cross-context monitoring prompt is the centerpiece. Unlike a standard safety prompt that receives a single transcript, it receives a bundle of aggregated evidence: excerpts from different sessions, temporal metadata, the clustering results. Its job is not to decide whether a transcript is malicious, but to determine whether a set of transcripts reveals a coordinated intent.

This is a paradigm shift. We move from "is this prompt dangerous?" to "do these 30 sessions, taken together, reveal an attack campaign?".

Experimental results: what the paper demonstrates

Large-scale simulations

The paper (full PDF version) details large-scale simulations where distributed attackers attempt to carry out cyberattack campaigns by fragmenting their instructions across numerous accounts. The results are unequivocal: stateless monitors (the previous state of the art) consistently fail to detect these attacks.

The Cross-Context Monitor, on the other hand, manages to identify distributed campaigns by aggregating weak signals. The paper precisely documents how evidence is stitched together across separate sessions to reconstruct malicious intent.

Controlled false positive rates

An obvious risk of this approach is the false positive: if two legitimate users are independently working on similar topics, the clustering might incorrectly group them together. The paper reports that selective escalation to the LLM plays a key role here as a safety net. The LLM evaluates the full context and can distinguish a coincidence from coordination.

The precise metrics for detection and false positive rates are in the full paper, but Anthropic's main argument is that the system is significantly superior to the status quo while remaining deployable.

Honestly discussed limitations

Anthropic does not claim to have solved the security problem of AI agents. The paper notes that the Cross-Context Monitor is a detection tool, not a prevention tool. It does not replace model-level guardrails, content filters, or training methods like SDAR that strengthen the intrinsic resilience of agents. It adds to these layers.

Project Glasswing and Anthropic's Sincere Admission

"No company has yet developed reliable guardrails"

The Project Glasswing, announced in 2026, is illuminating regarding Anthropic's position in the face of this threat. The company is committing up to $100 million in usage credits for Mythos Preview, plus $4 million in direct donations to open-source security organizations.

But the statement accompanying this project is striking: Anthropic explicitly states that no company — including itself — has yet developed reliable guardrails to prevent the malicious use of models with Mythos-level capabilities. This is a rare admission from an AI company about the limitations of its own security systems.

The Mythos Paradox

Mythos Preview perfectly illustrates the AI security paradox in 2026. On the one hand, it discovers more than 10,000 vulnerabilities (PYMNTS, 2026), which is an immense benefit to the security community. On the other hand, these same capabilities, in the wrong hands, are a weapon of mass destruction for cybersecurity.

The stateful monitoring paper must be read in this context: it is an attempt to build the guardrails that are missing, precisely because Mythos's capabilities make distributed monitoring critical. Without a system capable of seeing across sessions, an attacker could use multiple accounts to exploit the vulnerability discovery capabilities in an indirect manner.

Implications for companies deploying agents

You are probably vulnerable without knowing it

Any company deploying AI agents in production — whether in SaaS, internally, or via open-source agents with Ollama locally — is potentially exposed. The vulnerability is not in your code. It is in your monitoring architecture.

If your security system evaluates each agent conversation in isolation, a patient attacker can easily bypass it. They simply need to create multiple accounts and fragment their task. It's low-tech, it doesn't require bypassing sophisticated filters, and it works against almost all current deployments.

The most exposed sectors

Companies in regulated sectors (finance, healthcare, energy) are the primary targets. Kiteworks emphasizes that the 2026 compliance requirements (DORA in Europe, new US regulations) impose enhanced monitoring of AI systems. However, stateless monitors do not meet these requirements in the face of distributed attacks.

Agent hosting platforms (like Hugging Face Spaces, agentic service providers) are particularly concerned: they have thousands of user accounts and cannot manually analyze transcripts. The automated clustering proposed by Anthropic is directly applicable to their context.

What you need to do now

First, audit your security architecture. Ask the simple question: does your monitor have memory between sessions? If the answer is no, you have a documented blind spot.

Second, consider complementary defense layers. Stateful monitoring is a detection layer. Enhanced agent training is a prevention layer. Approaches like SkillOpt for self-evolving agents or inter-session learning mechanisms like Anthropic Dreaming show that agent resilience can also come from within.

Third, if you deploy on cloud infrastructures, ensure that your host provides appropriate security guarantees. Solutions like Hostinger for lightweight deployments or dedicated platforms for critical agentic workloads — in all cases, host security does not compensate for the lack of application-level monitoring.

Agentic security in June 2026: an ecosystem under construction

The Adversa AI roundup

The June 2026 Adversa AI roundup positions Anthropic's paper within a rapidly accelerating research ecosystem. Agentic security is no longer an academic niche: it is a field with its own conferences, benchmarks, and open-source tools.

The stateful monitoring paper fits into a clear trend: moving from prompt-level security to system-level security. Early work focused on prompt injection, jailbreaking, and exfiltration. The work of 2026 focuses on attacks that span long timeframes and multiple contexts.

Lessons from the 2026 breaches

The Beam.ai report on AI agent security breaches in 2026 draws concrete lessons from real incidents. A recurring pattern: attackers are no longer trying to trick a single agent. They are using the multiplicity of agents and sessions as an attack vector in itself.

This is exactly what the Anthropic paper models and detects. The breaches documented in 2026 confirm that the distributed attack is not a theoretical scenario but an active threat.

The interaction between capabilities and defenses

There is an inevitable arms race dynamic. As models like GPT-5.5 (98.2 on the agentic benchmark) or Claude Opus 4.7 Adaptive (94.3) become more capable, defenses must evolve at the same pace. The Cross-Context Monitor is a response to a specific level of capability: one where a model can plan and execute a multi-step attack, but where the attacker must still fragment to avoid per-session detection.

When models become even more capable, the fragmentation itself could become more subtle. Stateful monitoring is not a final solution. It is a necessary iteration in a race that will not stop.

❌ Common mistakes

Mistake 1: Thinking a good prompt guard is enough

What's wrong: Many teams invest heavily in prompt-level filters (refusing to generate malware, etc.) and think it's enough. The paper shows that these filters are trivially bypassed by distributing across multiple accounts. The solution: Add an inter-session monitoring layer. The prompt guard remains necessary but insufficient.

Mistake 2: Confusing rate limiting with security

What's wrong: Limiting the number of requests per account is a good infrastructure practice, but it is not a security measure. An attacker with 100 accounts has 100 times the rate limit. The solution: Rate limiting protects against resource abuse. Security requires behavioral analysis across accounts.

Mistake 3: Ignoring weak signals

What's wrong: Some monitoring systems only trigger an alert for explicit and immediate threats. Distributed attacks almost never have a strong signal in an individual session. The solution: Implement systematic collection of weak signals and an aggregation mechanism, like the clustering proposed in the paper.

Mistake 4: Evaluating each session with a powerful LLM

What's wrong: Sending every transcript to Claude Opus 4.7 or GPT-5.5 for security analysis would be accurate, but prohibitively expensive at scale. The solution: Use a multi-stage pipeline with lightweight filtering first (clustering, heuristics) and only escalate to the LLM for ambiguous cases.

❓ Frequently Asked Questions

Is the Cross-Context Monitor deployed at Anthropic?

Anthropic does not specify in the paper whether this system is in production on its own platforms. The paper presents it as a research framework validated by simulations. However, the context of Project Glasswing and statements about the absence of reliable guardrails suggest that effective deployment remains an ongoing challenge.

Does this system protect against AI swarm attacks?

Partially. The Cross-Context Monitor detects distributed attacks where a human fragments a task across multiple accounts. AI swarms (as described by Kiteworks) involve autonomously coordinated agents, which is a slightly different pattern. The clustering approach is applicable, but the signals to detect differ.

Can a small agent provider implement this approach?

The three-layer architecture is conceptually accessible, but real-time clustering on high volumes requires non-trivial infrastructure. For small providers, the practical takeaway is mostly to stop considering session-based monitoring as sufficient and to explore cross-session security solutions, even simplified ones.

Which LLM model should be used for the escalation step?

The paper does not prescribe a specific model. In practice, a model with good reasoning capabilities but a moderate cost would be suitable. Claude Sonnet 4.6 (81.4 on the agentic benchmark) or GPT-5.4 (87.6) could offer a good cost/performance balance for this aggregated context evaluation task.

Does this work against self-hosted open-source agents?

The paper applies to the context of a service provider that sees transcripts from multiple users passing through. For a self-hosted agent with Ollama locally, the distributed threat presents itself differently: it is the attacker who directly controls the agent. Stateful monitoring is more relevant for multi-user platforms than for single-tenant deployments.

✅ Conclusion

Anthropic's Stateful Online Monitoring paper demonstrates that AI agent security in 2026 can no longer afford to examine one session at a time — attackers distribute their operations across multiple accounts, and stateless monitors are structurally blind to this pattern. The Cross-Context Monitor with clustering and selective escalation to an LLM is the first credible technical response to this problem. If you are deploying agents in production, reading the full paper on arXiv should be your next step.

#anthropic #stateful-online-monitoring #attaques-distributees-agents-ia #securite-intelligence-artificielle #cyberattaque-ia #moniteurs-securite-ia

📚 Related articles

Skynet Watch 🟢 Débutant 14 min

OpenAI offers to cede 5% of its shares to the US government: the $42.6 billion redefining AI policy

OpenAI offers 5% stake to US govt, worth $42.6B. Discover how this deal redefines AI policy.

2026-07-14 15:03

Skynet Watch 🟢 Débutant 14 min

46% of US corporate AI tokens go to Chinese models — the CNBC survey that changes everything

CNBC survey: 46% of US enterprise AI tokens go to Chinese models. Discover this market shift.

2026-07-09 15:07

Skynet Watch 🟢 Débutant 18 min

UN: The first Global Dialogue on AI Governance opens in Geneva

Discover the first Global Dialogue on AI Governance in Geneva, a historic UN event bringing together states and tech giants.

2026-07-06 14:13

📑 Table of contents