📑 Table of contents

Title: CAISI: the 5 US AI labs are now under federal evaluation before deployment

Skynet Watch 🟢 Beginner ⏱️ 13 min read 📅 2026-05-19

CAISI : the 5 US AI labs are now under federal evaluation before deployment

🔎 100% of US frontier AI is now under federal control — but is it enough?

On May 5, 2026, Google DeepMind, Microsoft, and xAI signed pre-deployment evaluation agreements with CAISI. This gesture carries massive symbolic weight: for the first time, the entirety of US frontier AI is granting early access to government evaluators before any public release.

The context makes this moment even more striking. In May 2026, there are over 1,200 active AI bills across US states, according to BuildFastWithAI analyses. The federal government, until now largely hands-off, finds itself forced to structure a coherent response, lest regulation become fragmented.

The question is no longer whether AI models will be evaluated before deployment. It is about understanding what these evaluations are truly worth, what they cover, and what they deliberately leave out.


The key points

  • On May 5, 2026, Google DeepMind, Microsoft, and xAI join OpenAI and Anthropic in the CAISI pre-deployment evaluation program.
  • 100% of US frontier labs are now under voluntary federal review via the NIST CAISI program and the TRAINS taskforce.
  • Evaluations focus on national security vulnerabilities, misuse risks, and unexpected behaviors — not on commercial compliance or general ethics.
  • This framework remains voluntary, non-legislative, and does not produce any public safety stamp on the evaluated models.

Tool Main usage Price (June 2025, check on site.com) Ideal for
Hostinger Web hosting for AI projects Starting at 2.99 €/month Deploying AI apps without infra management
Claude Opus 4.7 (Adaptive) Advanced reasoning and agentic Via Anthropic API Complex tasks requiring reliability
GPT-5.5 Autonomous agent and generation Via OpenAI API Multi-step agentic workflows
Gemini 3 Pro Deep Think Deep analysis and reasoning Via Google API Long reasoning, benchmarks

What CAISI actually does — and what it doesn't do

CAISI (Center for AI Safety and Institutional Integrity) is a program attached to NIST. Its role: to organize targeted evaluations of frontier models before their public release.

In practice, labs grant early access to their unreleased models. Federal evaluators — primarily via the TRAINS Taskforce (Testing Risks of AI for National Security), convened in November 2024 — conduct probing tests. This involves detecting vulnerabilities, assessing risks of misuse for national security purposes, and observing unexpected behaviors.

This point is crucial and often misunderstood: CAISI evaluations are not certifications. Moyens.net makes this clear, pointing out that there is no public safety stamp issued at the end of these tests. The government does not approve a model. It tests it, period.

Arnav Gupta notes in his May 10, 2026 analysis that this policy shift marks a clear break from the hands-off approach observed a year earlier. NIST has moved from an observer role to that of a direct actor in the model development cycle.

What CAISI does not cover: discriminatory deployment biases, GDPR compliance, employment impacts, or the intellectual property of training data. The lens is strictly national security.


The 5 signatory labs — who controls what

The table below summarizes the situation as of May 5, 2026. All models mentioned are drawn from the list of generalist and agentic LLMs from June 2025.

Lab CAISI signing date Known frontier models (June 2025) Max agentic score
OpenAI 2024 GPT-5.5 (98.2), GPT-5.4 Pro (91.8), GPT-5.4 (87.6) 98.2
Anthropic 2024 Claude Opus 4.7 Adaptive (94.3), Claude Opus 4.6 (84.7) 94.3
Google DeepMind May 5, 2026 Gemini 3 Pro Deep Think (95.4), Gemini 3.1 Pro (87.3) 95.4
Microsoft May 5, 2026 (Depends on its OpenAI partnership + proprietary models)
xAI May 5, 2026 Grok 4.1 (79) 79

The disparity in agentic scores is notable. GPT-5.5 clearly dominates with 98.2, while xAI's Grok 4.1 sits at 79. This raises an interesting question: are CAISI evaluations calibrated for low-agentic-capability models as much as they are for highly autonomous agents?

The probable answer is no. National security risks evolve with model capability. A Grok 4.1 at 79 does not present the same threat profile as a GPT-5.5 capable of multi-agent orchestration, as seen in task delegation architectures with sub-agents.


Why May 2026 — the timing is no coincidence

May 2026 is described by BuildFastWithAI as one of the busiest fortnights in AI history. The CAISI announcement is part of a sequence of regulatory moves of unprecedented intensity.

The main trigger factor: legislative pressure from the States. With more than 1,200 active AI bills at the local level, the federal government risked being completely bypassed. Each State was developing its own definition of what a dangerous model is, its own testing requirements.

This level of fragmentation would have been an operational nightmare for the labs. Imagine Google DeepMind having to go through 50 different reviews before deploying Gemini 3.1 Pro in the United States. The CAISI program offers a centralized alternative which, if it doesn't pre-empt all state laws, at least creates a reference standard.

The other factor: the White House wants to verify AI models before their release. This political reversal, documented separately, translated into a clear signal sent to the labs: negotiate a voluntary framework now, or suffer an imposed framework later.

The three labs of May 2026 therefore signed less out of conviction than out of strategic calculation. A negotiated voluntary agreement is always better than a non-negotiated legal obligation.


What evaluations reveal about the state of the frontier

The fact that evaluations focus on unreleased models tells us something important about the pace of development. The June 2025 models — GPT-5.5, Claude Opus 4.7, Gemini 3 Pro Deep Think — are likely already surpassed internally.

The agentic scores from June 2025 show an extremely concentrated frontier. The top three models (GPT-5.5, Gemini 3 Pro Deep Think, Claude Opus 4.7) are within 3.9 points of each other. Below them, GPT-5.4 Pro drops to 91.8, followed by o1-preview at 90.2. The gap between the top 3 and the rest is significant.

For CAISI, this means that national security evaluations are effectively focused on a very small number of models. Dangerous capabilities — extended autonomy, complex planning, access to external tools — are primarily driven by this trio.

The TRAINS program was designed for this reality: few models, but with a potentially disproportionate impact. Evaluation resources are scarce. Concentrating them on the 2-3 most capable models in each release cycle is the only viable approach.


The structural limits of the CAISI system

Voluntary nature is a flaw, not a strength

CAISI is presented as a success because 100% of the American frontier participates in it. But this 100% relies on a voluntary commitment. Nothing prevents a lab from withdrawing.

The ITIF (Information Technology and Innovation Foundation) even published a counterpoint on May 11, 2026, arguing that a pre-approval regime risks politicizing AI development. According to ITIF, delays based on shifting political judgments could slow down innovation without concretely improving safety.

This is not an absurd argument. A voluntary framework without a legal basis means that evaluation criteria can change from one administration to the next. What is considered an acceptable risk in 2026 could become unacceptable in 2028, without any transparency regarding the underlying reasoning.

The absence of public transparency

No CAISI evaluation report is made public. We know that a model was tested, not what was found in it. This opacity serves two purposes: protecting discovered vulnerabilities (legitimate for national security) and shielding labs from bad press (much less legitimate).

The net result: the public must trust that the process exists without being able to evaluate its effectiveness. It is a considerable gamble on the institutional credibility of the NIST at a time when trust in US federal institutions is historically low.

The national security perimeter is too narrow

The most probable AI risks are not necessarily national security risks. A model that generates false medical content at scale, that amplifies discriminatory biases in credit decisions, or that destroys labor markets — none of these scenarios are covered by CAISI.

The national security framing has the advantage of being politically consensual and legally solid. But it creates a massive blind spot regarding systemic civilian risks.


The geopolitics behind the signatures

The CAISI agreements must also be read through the prism of international competition. When the US government gets early access to models from Google DeepMind, Microsoft and xAI, it is not just engaging in regulation. It is conducting technological intelligence.

This aspect is rarely discussed openly, but it is central. The evaluations give the government a fine-grained understanding of the state of the art before it becomes public. This informs investment decisions, technological diplomacy, and defense.

The dynamic is reminiscent of what we observe with geographic access restrictions. When Anthropic denies China access to certain models, the logic is similar: controlling the spread of technological capability. The CAISI internalizes this logic at the national level — the government sees before everyone else.

For the labs, this exchange is implicitly transactional. In exchange for early access, they get a favorable regulatory narrative and potentially protection against more aggressive state regimes.


Comparison with other regulatory frameworks

CAISI is not the only pre-deployment evaluation framework in the world. But it is the only one that covers 100% of a country's frontier.

Framework Country Mandatory? Scope Transparency
CAISI / TRAINS United States Voluntary National security None (non-public reports)
EU AI Act European Union Yes (systemic models) Broad (fundamental rights, safety) Moderate (transparency obligations)
AI Safety Institute (UK) United Kingdom Voluntary Overall safety Partial (summary reports)

The EU AI Act imposes legal obligations for systemic risk models, with a much broader scope than just national security. But it obviously does not directly cover American labs — only their deployments in Europe.

CAISI has the advantage of depth of access (unreleased models) but the disadvantage of a narrow scope and the absence of an enforcement mechanism.


Impact on the model release cycle

A practical question: do CAISI evaluations delay launches? The official answer is no — labs integrate the process into their development timeline. The realistic answer is more nuanced.

Evaluations require early access, which implies that the model must be in a sufficiently stable state to be tested. This creates a freeze point in the development pipeline. For a model like GPT-5.5 with an agentic score of 98.2, national security testing is likely complex and time-consuming.

If CAISI identifies a critical vulnerability, the lab is technically free to fix it or not before release. But the political and media pressure to fix it would be immense. In effect, the program creates an informal delay mechanism even without formal blocking power.

ITIF raises a valid point here: this informal mechanism is precisely what makes the system vulnerable to politicization. If a vulnerability is discovered but fixing it takes 3 additional months, who decides whether the delay is justified?


Non-US labs — the great absence from the debate

CAISI covers 100% of the US frontier. But the global frontier also includes DeepSeek (China), Moonshot AI (China), and Z.AI (China). These labs are obviously not subject to CAISI.

DeepSeek V4 Pro (Max) reaches an overall score of 88 in June 2025, and Moonshot AI's Kimi K2.6 climbs to 88.1 in agentic (in self-host). These are not marginal models.

CAISI therefore creates an asymmetry: US models undergo a federal evaluation process that potentially slows down their cycle, while Chinese models are deployed without an equivalent constraint.

This asymmetry is at the heart of the political debate. On the one hand, CAISI supporters argue that trust in US models is a competitive advantage — if a model has gone through CAISI, foreign businesses and governments can adopt it with greater confidence. On the other hand, critics point out that China will not wait for the United States to finish its evaluations.


❌ Common mistakes

Mistake 1: Confusing CAISI evaluation and security certification

The most widespread mistake is presenting the CAISI agreements as a form of security label. This is not the case. The evaluations are targeted tests, not a comprehensive audit. No stamp is issued, no result is public. The model is not "approved" — it has been "tested".

The correction: always specify that CAISI is a governmental testing mechanism, not a certification regime.

Mistake 2: Thinking that CAISI covers all AI risks

CAISI is strictly focused on national security. Risks of bias, civilian disinformation, environmental impact, copyright — all of this is out of scope. Presenting CAISI as a global safety net for AI is misleading.

The correction: systematically qualify the scope ("national security only") when mentioning the evaluations.

Mistake 3: Equating the voluntary nature with an absence of pressure

Saying that the labs "chose" to sign gives the impression of a purely voluntary act. In reality, the political pressure was considerable. Between the 1,200 state bills and the signal from the White House, the choice was between a negotiated framework and an imposed framework.

The correction: use "voluntary but under political pressure" rather than simply "voluntary".


❓ Frequently Asked Questions

Can CAISI prevent a model from being released?

No. The framework is voluntary and does not grant the government any veto power. Labs can theoretically deploy a model even if vulnerabilities are identified. Political and media pressure serves as an informal delay mechanism.

Are the evaluation results public?

No. CAISI evaluation reports are not made public. Only the existence of the agreement is known. This opacity aims to protect discovered vulnerabilities, but it prevents any external evaluation of the program's effectiveness.

Are non-US labs affected?

No. CAISI only covers labs that have signed on voluntarily. Models from DeepSeek, Moonshot AI, and Z.AI are not subject to these evaluations, which creates a competitive asymmetry with the US frontier.

What is the connection between CAISI and TRAINS Taskforce?

The TRAINS Taskforce (Testing Risks of AI for National Security), convened in November 2024, is the operational arm of the evaluations. CAISI is the institutional framework that houses the agreements. TRAINS executes the tests in the field.

Does the EU AI Act make CAISI redundant for European deployments?

No, because the scopes differ. The EU AI Act imposes broad legal obligations (fundamental rights, transparency) for deployments in Europe. CAISI remains the sole mechanism for accessing unreleased models for national security testing, regardless of the deployment location.


✅ Conclusion

CAISI marks a turning point: for the first time, the US federal government has its eyes on the development pipeline of 100% of the nation's frontier AI. But early access without transparency, without veto power, and without a scope beyond national security remains an incomplete framework. The real question isn't whether the labs will sign — it's whether what they sign changes anything for the end user. For now, the answer is: not yet.