Red teaming AI agents: from several weeks to a few hours

Agents IA 🟢 Beginner ⏱️ 16 min read 📅 2026-05-09

Red teaming AI agents: from several weeks to a few hours

🔎 AI agents decide on their own — and no one is really testing them

AI agents no longer just answer questions. They plan, execute, iterate, correct their own mistakes. In healthcare, an agent can prescribe a protocol. In finance, it triggers orders. In defense, it recommends strikes. And the problem is simple: no one really knows how to test them reliably.

A study published on arXiv in May 2026 (Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours) documents an alarming fact. Security teams spend several weeks manually building test workflows to evaluate a single agent system. When the results are insufficient, everything has to be redone. It's a bottleneck that makes agent security practically impossible at scale.

The paper introduces a red teaming agent built on the open-source Dreadnode SDK. The measured gain: a 100x acceleration, going from weeks to a few hours to set up a complete campaign. The stakes go beyond technical performance. It's a matter of timing: agents are arriving in production now, and security tools haven't kept up.

The key takeaways

Agent systems are significantly more vulnerable than traditional LLMs because each autonomous step opens a new attack surface.
Current manual red teaming takes several weeks per campaign, which is incompatible with the aggressive deployment cycles of agents.
The Dreadnode SDK offers an automated red teaming agent with 45+ adversarial attacks, 450+ transforms, and 130+ scorers.
The 100x acceleration makes systematic testing of agents deployed in critical domains (healthcare, finance, defense) feasible.
The approach is open-source, and therefore auditable and extensible — a crucial point for regulated organizations.

Recommended tools

Tool	Main use case	Price (June 2025, check official website)	Ideal for
Dreadnode SDK	Automated red teaming of AI agents	Open-source (free)	Advanced security teams
OpenClaw	Autonomous AI agent for testing and automation	Open-source	Prototyping robust agents
Ollama	Running AI agents locally	Open-source (free)	Isolated, air-gapped testing
Hostinger	Hosting for deploying security dashboards	Starting at €2.99/month	Small teams, MVP

Why AI agents are more dangerous than traditional LLMs

A traditional LLM is predictable in its dangerousness. You ask a question, it answers. The attack surface is confined to the input (the prompt) and the output (the response). An agent is something else entirely.

An agent chains steps together: it perceives an environment, plans a sequence of actions, executes, observes the result, adjusts. Every loop is a potential point of failure. The May 2026 study points this out: multi-step, multimodal, and multilingual agents create radically new attack surfaces that testing methods designed for LLMs do not cover.

Take a autonomous trading agent based on GPT-5.5. It analyzes markets, generates a signal, executes an order, monitors the execution, adjusts its position. A subtle prompt injection at the analysis stage can trigger a cascade of erratic actions over the next 4 steps. Classic red teaming tests step 1. The attack exploits step 3.

The fundamental difference is feedback. Agents learn from their own actions in real time. An adversary doesn't need to crack the model once and for all — they just need to inject a bias at a key moment for the agent to amplify it itself. This is exactly the type of vector that manual red teaming approaches cannot systematically捕捉r.

The paper also notes that collaborative multi-agent systems further aggravate the problem. When multiple agents communicate with each other, an injection into a single agent can propagate to the entire network. Red teaming must then test not a model, but a dynamic system with changing internal states.

The nightmare of current manual red teaming

The study describes a process that any AI security team will recognize, probably with a shudder. The typical red teaming workflow for an agent looks like this.

First, the operator chooses an attack library. Then, they manually assemble a pipeline: an initial attack, one or more transformations to mutate the payload, a scorer to evaluate whether the agent gave in. Then they launch it, observe the results, adjust. Insufficient results? They dismantle the pipeline and build another one.

This process takes several weeks for a single targeted campaign. Not to test the whole system — to test one specific vector. Multiply that by the number of possible adversarial scenarios, and you understand why most agents deployed in production have never been seriously tested.

The study identifies three structural problems in the manual approach. First problem: dependency on specific libraries. Every red teaming tool has its own format, its own primitives. Switching from one tool to another means rewriting the entire workflow. Second problem: the absence of automated exploration. The operator must decide a priori which attacks to combine, instead of letting the system discover the most effective combinations. Third problem: the human cost. The best security engineers spend weeks doing pipelining work instead of analyzing vulnerabilities.

It's a model that might have worked when LLMs were research products. But with agents making decisions in hospitals and banks, it's a catastrophically inadequate model.

The Dreadnode approach: an agent that tests agents

The paper's central proposal is elegant in its simplicity: replace the operator's manual work with a red teaming agent that builds the workflows itself.

The system is built on the Dreadnode SDK, an open-source framework. The agent has access to a massive library of primitives: over 45 adversarial attacks, over 450 transformations (the "transforms" that mutate and combine payloads), and over 130 scorers (the metrics that evaluate whether the attack succeeded).

The operator no longer specifies how to build the pipeline. They specify what to test: "check if this financial agent can be manipulated into executing unauthorized orders via injections in market data feeds." The red teaming agent then explores the space of possible combinations — which attack, which transformation, which scorer — autonomously.

The result is a measured 100x acceleration. What took weeks now takes a few hours. But the number alone doesn't capture the real benefit. The red teaming agent discovers attack combinations that a human would never have considered. The 450 transforms aren't there for decoration — they create a combinatorial space that manual exploration cannot cover.

For teams building agents with agentic LLMs like GPT-5.5 or Claude Opus 4.7, this changes the game. You can now test your agent before deploying it, not six months after. The "build → test → fix → deploy" cycle becomes viable again.

The 3 protection layers of the Dreadnode SDK

The proposed architecture isn't just "an LLM that generates malicious prompts". It's a system structured in three layers that are worth understanding.

The attack layer: 45+ adversarial vectors

The attacks are not limited to classic prompt injection. The SDK covers sensitive data extraction attacks, context manipulations, omission attacks (making a constraint ignored), multilingual attacks (exploiting translation weaknesses), and multimodal attacks (malicious images or audio). For an agent processing medical records in multiple languages with X-ray images, each vector is a potential entry point.

The transformation layer: 450+ mutations

This is where the combinatorial power lies. A transform takes an attack payload and modifies it — paraphrasing, inserting invisible characters, partial base64 encoding, language mixing, semantic perturbation. The red teaming agent chains transforms to create variants that defenses have never seen. A human would manually assemble 3-4 of them. The system tests hundreds per hour.

The scoring layer: 130+ success metrics

How do you know if an attack worked on an agent? It's not binary like for a traditional LLM. A scorer can measure whether the agent deviated from its task plan, whether it exposed internal data, whether it executed an out-of-scope action, or whether it entered a dangerous loop. The multiplicity of scorers makes it possible to detect partial failures — vulnerabilities that don't cause an immediate crash but progressively weaken the system.

Concrete targets: what the system can test

The paper identifies three types of targets that the approach covers and which represent the majority of current critical deployments.

Multi-agent systems

When multiple agents collaborate, the attack surface explodes. Agent A sends a message to agent B, which transmits it to agent C. An injection in the A→B message can be amplified by B before reaching C. Red teaming must test the entire chain, not an isolated agent. This is exactly the type of scenario where patterns of collaborative agents become risk vectors.

Multilingual targets

Models like GPT-5.5 and Gemini 3 Pro Deep Think are multilingual by design. But their defenses are not uniform from one language to another. An attack that fails in English can succeed in Japanese or Arabic because the safeguards are less robust in these languages. With 450 transforms including linguistic mutations, the system systematically explores these asymmetries.

Multimodal targets

Agents that process images, audio, and text simultaneously open up cross-modal attack vectors. An apparently innocent image combined with a text prompt can bypass filters that each modality would have blocked separately. For agents deployed in medical diagnosis or surveillance analysis, this is a risk that no one can ignore.

Implications for production deployments

The study arrives at a critical moment. AI agents are transitioning from technological demonstration to operational deployment in domains where an error can kill.

In healthcare, agents based on Claude Opus 4.7 or GPT-5.4 Pro are starting to assist doctors in triage and therapeutic recommendation. A manipulated agent could recommend a contraindicated treatment. Red teaming this type of system must test not only direct responses but the chains of reasoning — what the Dreadnode SDK enables via its specialized scorers.

In finance, autonomous agents execute trading strategies with real money. A prompt injection via a manipulated news feed could trigger a series of catastrophic orders. A red teaming speed of 100x means a hedge fund can test its agent before every model update, not just once a quarter.

In defense, decision support systems are the most sensitive. The study explicitly mentions this domain as critical. An agent that recommends military actions based on sensor data must be tested against adversaries who know exactly how to manipulate this data. The fact that the Dreadnode SDK is open-source is a major asset here: government agencies can fully audit it, unlike proprietary solutions.

The race toward humanoid robots and autonomous physical systems adds an additional dimension. An agent that controls a physical body no longer has just words to lose — it has movements to lose. Red teaming must then integrate physical safety, and testing speed literally becomes a matter of public safety.

How to integrate red teaming into your agent pipeline

The paper's promise is appealing, but how does a team concretely integrate it? Here is a pragmatic framework.

First, test before configuring. Before configuring the skills and personalities of your agent, define the red teaming scenarios. What are the actions the agent must never take? What data must never leak? These constraints become the scorers for your campaign.

Second, run locally first. The Dreadnode SDK and models like Kimi K2.6 or GLM-5 (Reasoning) in self-host allow you to run agents locally for red teaming, without exposing your targets to the outside. This is non-negotiable for sensitive data.

Third, automate the cycle. Red teaming should not be a one-off event. Integrate it into your CI/CD. With every prompt, tool, or underlying model update, re-run the campaigns. This is only possible if the cycle takes hours, not weeks — hence the importance of the 100x acceleration.

Fourth, test interactions between agents. If you use a multi-agent architecture, do not test each agent in isolation. Test inter-agent messages, task delegations, and priority conflicts. This is where the most surprising vulnerabilities hide.

Limitations and what the paper does not solve

The approach is impressive but not magical. Several limitations deserve discussion.

Coverage is not total. 45+ attacks is a lot, but the space of possible attacks is infinite. The system finds vulnerabilities in the space it explores. For what it doesn't find, we don't know if it doesn't exist or if the system didn't look in the right place. This is the classic problem of any security tool: the absence of detected vulnerabilities does not prove the absence of vulnerabilities.

Scorers remain subjective. Defining what constitutes "dangerous behavior" for an agent is not trivial. An agent that refuses to act (over-refusal) is just as much of a problem as an agent that acts in an uncontrolled manner. The 130+ scorers cover many cases, but the threshold between "robust" and "too cautious" depends on the use case.

The approach does not replace human auditing. The paper states this clearly: the red teaming agent accelerates the operator's work, it does not replace it. The operator must interpret the results, decide on corrective actions, and assess the residual risk. It is a force multiplication tool, not a substitution tool.

Finally, the paper does not explicitly address the security of the underlying models. If you use GPT-5.4 Pro via API, you also depend on OpenAI's safeguards. Red teaming your agent does not cover the vulnerabilities of the model itself. This is why some teams opt for open-source LLMs in self-host where they control the entire chain.

The agentic model landscape facing red teaming

Not all models defend themselves equally against attacks. The agentic scores from June 2025 give a hint, but the reality of red teaming is more nuanced.

Model	Agentic Score	Security Strength	Known Weakness
GPT-5.5 (OpenAI)	98.2	Mature safeguards, advanced RLHF	Secondary multilingual vulnerabilities
Gemini 3 Pro Deep Think (Google)	95.4	Deep reasoning, manipulation detection	High latency complicates real-time testing
Claude Opus 4.7 Adaptive (Anthropic)	94.3	Constitutional AI, nuanced refusal	Over-refusal on certain legitimate scenarios
GPT-5.4 Pro (OpenAI)	91.8	Good performance/security balance	Less robust on multimodal attacks
Kimi K2.6 Self-host	88.1	Total local control	Less mature safeguards than US models
GLM-5 Reasoning Self-host	82.0	Fully auditable	More restricted training corpus

An important point: agentic scores measure the ability to act, not the resistance to attacks. A model that scores 98 can be more vulnerable than a model that scores 82 if its action capabilities exceed its safeguards. This is exactly the paradox that red teaming must resolve.

❌ Common mistakes

Mistake 1: Confusing LLM evaluation and agent red teaming

A SWE-bench or HumanEval benchmark measures whether the model codes well. Red teaming measures whether the agent can be diverted from its objective. These are orthogonal metrics. An agent that succeeds at 99% of legitimate tasks but yields to 1% of adversarial attacks is an agent that cannot be deployed in critical production.

The solution: separate your performance metrics from your security metrics. The Dreadnode SDK helps with the second category, not the first.

Mistake 2: Testing only the initial prompt

The most frequent mistake in agent red teaming is focusing on the initial system prompt. But an agent receives inputs throughout its execution: search results, API returns, messages from other agents. Each of these entry points is an injection vector.

The solution: test every step of the agent workflow, not just the starting point. This is what the 45+ attacks coupled with the SDK's 450+ transforms enable.

Mistake 3: Considering red teaming as a one-shot

Many teams do a red teaming before launch and then never return to it. But every prompt update, every new tool added, every change in the underlying model can introduce new vulnerabilities.

The solution: integrate red teaming into your CI/CD pipeline. The 100x acceleration makes this economically viable for the first time.

Mistake 4: Ignoring multilingual attacks

If your agent operates in French but the underlying model was trained mostly in English, defenses can be asymmetrical. An attack in Chinese or Arabic can bypass filters that work in French.

The solution: explicitly include multilingual scenarios in your red teaming campaigns. The Dreadnode SDK's 450+ transforms include linguistic mutations for this reason.

❓ Frequently Asked Questions

Does the Dreadnode SDK replace existing red teaming tools?

No, it complements them. The SDK provides an agent that orchestrates attacks by combining primitives. You can still use specialized tools for specific vectors, but the SDK automates the assembly and exploration of workflows.

Does it work with agents built on open-source LLMs like Kimi K2.6 or GLM-5?

Yes. The SDK is model-agnostic. It tests the agent through its interface (API, CLI, etc.), not by accessing the model weights. You can red team an agent based on any LLM pour agents, whether it is proprietary or self-hosted.

What level of technical expertise is required to use it?

The paper targets AI security operators, not beginners. You need to understand the concepts of adversarial attacks, prompt injection, and scoring. However, the automation drastically reduces the need for pipeline engineering expertise—that is precisely the point.

Is the approach applicable to agents with graphical interfaces (computer use)?

The paper mentions multimodal targets, which suggests partial coverage. Agents that interact with graphical interfaces (like Claude with Computer Use) add a layer of complexity that the SDK does not explicitly address in this version.

Can you red team a multi-agent system in production without disrupting it?

It's tricky. Red teaming involves sending potentially disruptive inputs. In production, you either need to use a mirrored environment or design attacks to be observable without being executed ("dry run" mode). This is an engineering challenge that the paper does not solve directly.

✅ Conclusion

The red teaming of AI agents has just moved from craftsmanship to industrialization. The May 2026 study shows that it is now possible to systematically test agent systems in a few hours instead of weeks—and above all, to discover vulnerabilities that no human would have envisaged. If you are deploying agents in critical domains, intégrer un red teaming automatisé n'est plus une option, it is a duty of diligence. Agents make decisions on their own: the least we can do is test them on their own as well.

#securite-ia #agents-ia #red-teaming #tests-ia #intelligence-artificielle-autonome

📚 Related articles

Agents IA 🟢 Débutant 16 min

Qwen-AgentWorld : when an LLM simulates the world to train autonomous agents — the new frontier of language world modeling

Discover Alibaba's Qwen-AgentWorld: a revolutionary LLM that simulates the world to train autonomous agents. The new frontier of language world mo

2026-06-30 17:05

Agents IA 🟢 Débutant 13 min

Agentic Resource Discovery: the open standard that will unify AI agents

Discover Agentic Resource Discovery, the new open standard from Google and Microsoft designed to unify AI agents and automate their tool discove

2026-06-27 15:05

Agents IA 🟢 Débutant 11 min

Google launches the Interactions API in general availability: the new default interface for building Gemini agents (and generateContent retires)

Google launches Interactions API to GA. Discover the new default interface for your Gemini agents and the end of generateContent.

2026-06-24 17:03

📑 Table of contents