LLMSurgeon: this ACL 2026 paper opens the black box of LLM pre-training

LLM & Modèles 🟢 Beginner ⏱️ 14 min read 📅 2026-05-29

LLMSurgeon : this ACL 2026 paper opens the black box of LLM pre-training

🔎 AI's best-kept secret is leaking

Every language model carries within it an invisible signature: the data mixture it was trained on. This is what makes one model excel at code, another stumble at logical reasoning, and a third reproduce specific cultural biases. The problem? This composition is almost systematically kept secret by labs.

LLMSurgeon, a paper accepted at ACL 2026 (Main Conference), changes the game. Yaxin Luo's team demonstrates that it is possible to recover the pre-training mixture of any LLM from its generated text only. No access to weights. No internal leak. Just text.

This breakthrough opens up a huge field for auditing, transparency, and fundamental understanding of what our models truly know — and what they don't.

The essentials

LLMSurgeon is a post-hoc framework that reconstructs the pre-training data mixture of an LLM without access to its weights or its training data.
The method combines a calibrated domain classifier and a label-shift correction to compensate for the systematic biases of the classifier.
The paper introduces LLMScan, a reference benchmark evaluating 8 open-source LLMs, with results that confirm the reliability of the approach.
It is a major breakthrough for understanding LLM billing and behavior, as the data mixture directly impacts the quality of outputs in each domain.

Tools and resources

Resource	Main usage	Access	Ideal for
LLMSurgeon (arXiv)	Read the full paper	Free	Researchers, AI engineers
LLMSurgeon (GitHub)	Source code, pipeline, documentation	Open-source	Practical implementation
LLMSurgeon (PDF)	Detailed figures, experimental results	Free	In-depth analysis

The problem: why we don't know the data mixture

The short answer: labs have no commercial interest in disclosing it.

The pre-training data mixture is considered a competitive advantage. When DeepSeek, Google, or Anthropic train a model, the exact proportion of code, scientific texts, web data, books, multilingual data — all of this remains confidential.

Yet, this composition is the model's digital DNA. It determines its strengths, its weaknesses, its biases. A model trained with 40% Python code will have a radically different skill profile than a model trained with 5% code.

The technical challenge is real: how do you audit this mixture when you only have access to the API or the model's text outputs? This is exactly the question that LLMSurgeon solves.

What the existing literature proposed — and why it wasn't enough

Before LLMSurgeon, existing approaches were divided into two categories, both limited.

Weight-based methods required internal access to the model — impossible for proprietary models and even for many open-source models whose exact data is unknown. Text-output-based methods, on the other hand, used naive classifiers that systematically confused "what the model knows how to generate" with "what it was trained on".

This bias is fundamental. A model trained primarily on web data can still generate quality code because code is present in web data. A naive classifier will then overestimate the proportion of code in the pre-training.

LLMSurgeon precisely corrects this bias.

How LLMSurgeon works: the pipeline explained

LLMSurgeon relies on an elegant idea: using the text generated by an LLM as a fingerprint, then applying a mathematical correction to recover the actual distribution of the training data.

The pipeline consists of three key steps, detailed in the project's GitHub documentation.

Step 1: Train a proxy classifier on labeled reference data

We start by building a reference dataset covering the suspected domains: code, scientific, web, literature, mathematics, etc. Each text is labeled with its domain of origin.

A classifier is trained on this data to predict the domain of a given text. This classifier is nothing revolutionary in itself — it's a standard NLP tool.

The subtlety is that we don't trust it blindly. Its role is to produce raw predictions that will be corrected later.

Step 2: Estimate the calibrated confusion matrix

This is where LLMSurgeon radically differs from previous approaches. Instead of taking the classifier's predictions at face value, the framework estimates a calibrated soft confusion matrix.

Concretely: we systematically measure how the classifier makes mistakes. If it classifies 15% of "code" texts as "scientific", this bias is captured in the matrix. If it confuses "web" and "literature" in 8% of cases, that is also noted.

This matrix describes the systematic errors of the classifier, independently of the audited model. It is a measuring instrument whose bias is known — which makes it possible to correct it.

Step 3: Solve the inverse problem under the label-shift assumption

The final step, the most mathematical one. We generate a large volume of text with the target LLM. The classifier produces predictions on this text. These predictions are biased — we know this thanks to the confusion matrix.

LLMSurgeon then formulates an inverse problem: knowing the biased predictions and the confusion matrix, what is the latent distribution (the true data mixture) that produced these predictions?

This is a resolution problem under the label-shift assumption, a well-known technique in transfer learning. The solution provides an estimate of the proportion of each domain in the pre-training data.

All of this is detailed in Figure 1 of the paper, which provides a clear overview of the "Data Mixture Surgery" pipeline.

LLMScan : the benchmark that proves it works

A theoretical framework without experimental validation is worthless. LLMSurgeon is accompanied by LLMScan, a benchmark built specifically to evaluate the reliability of mixture estimates.

LLMScan was applied to 8 open-source LLMs. The results, presented in the original paper on arXiv, show that LLMSurgeon manages to estimate domain proportions with significantly higher accuracy than existing methods.

Why this benchmark is credible

The credibility of LLMScan relies on a crucial point: for the evaluated open-source models, the authors were able to compare LLMSurgeon's estimates with the actual mixtures (known for these models). The average error is significantly reduced compared to naive classifiers.

This is the first time a post-hoc method demonstrates this level of accuracy on such a diverse set of models.

Lessons from LLMScan

The results reveal interesting patterns. For example, the proportion of code in pre-training data is systematically underestimated by naive methods — which confirms the theoretical bias identified by the authors. Similarly, "close" domains (web vs. literature, scientific vs. mathematics) are where label-shift correction brings the most value.

These results have direct implications for anyone comparing models. When we see that DeepSeek V4 Pro (Max) reaches 88 on the general benchmark or that Claude Opus 4.7 (Adaptive) reaches 94.3 in agentic, understanding the underlying mixture allows us to interpret these scores with more nuance.

What LLMSurgeon changes for the AI ecosystem

For transparency and auditing

LLMSurgeon gives researchers and regulators a concrete tool to audit models without relying on the goodwill of companies. This is a non-negligible shift in power.

A European regulator keen on enforcing the AI Act could use LLMSurgeon to verify that a model does not contain a problematic proportion of data from non-compliant sources. A researcher could audit a proprietary model to check whether its reasoning performance corresponds to genuine exposure to scientific data or to a mixture artifact.

For scientific reproducibility

Reproducibility is in crisis in AI. Research papers describe architectures and hyperparameters, but the data mixture — often the most determining factor — remains a black hole.

LLMSurgeon at least makes it possible to measure this variable in existing models, even if it cannot be exactly reproduced. This is a step forward for scientific methodology in the field.

This need for transparency echoes the concerns surrounding General Preference RL, a paper that unifies reinforcement learning and preference optimization for LLMs. Both papers share a common ambition: understanding what goes on inside the black box.

For teams choosing a model

If you have to choose between GPT-5.5 (98.2 in agentic, 91 in general), Gemini 3.1 Pro (92 in general), or Claude Opus 4.7 (90 in general, 94.3 in agentic), knowing the data mixture adds a crucial decision-making dimension.

A model with 35% code in its pre-training will naturally be more robust for development tasks, even if its raw score is slightly lower. A model with a high proportion of multilingual data will be better suited for international use cases. LLMSurgeon provides access to this information.

The current limitations of LLMSurgeon

Domain granularity

LLMSurgeon works well with broad categories (code, scientific, web, literature). But the more you refine the categories (for example, separating "Python code" from "JavaScript code", or "physics" from "biology"), the denser the confusion matrix becomes, and the more ill-conditioned the inverse problem becomes.

The authors are transparent about this limitation. The framework is designed for a medium level of granularity — fine enough to be useful, broad enough to remain reliable.

Dependence on the proxy classifier

The quality of LLMSurgeon's estimates depends directly on the quality of the proxy classifier and the reference data. If the benchmark dataset is biased or unrepresentative, the estimates will be as well.

This is a classic problem in supervised learning, but it is worth emphasizing: LLMSurgeon does not create information out of thin air. It extracts and corrects the information available in the generated text, within the limits of what the classifier can distinguish.

The inability to detect total absence

If a domain is completely absent from the pre-training data, LLMSurgeon cannot detect it — because the model will simply never generate text in that domain. The framework is a tool for proportional diagnosis, not exhaustive detection.

LLMSurgeon and current models: what we could discover

Applying LLMSurgeon to the current generation of models would reveal fascinating insights. Let's take a few concrete examples.

The case of leading agentic models

OpenAI's GPT-5.5 dominates agentic leaderboards with 98.2. Claude Opus 4.7 (Adaptive) follows with 94.3. The natural question: what portion of their pre-training is dedicated to tool interaction data, planning, and chain of thought?

These agent-specific domains are relatively new in pre-training corpora. LLMSurgeon could reveal whether GPT-5.5's superiority comes from massive exposure to this data or from other architectural factors.

For teams building agent systems, this information is strategic. Choosing a model for AI agents should not be based solely on benchmark scores.

The case of generalist models

Google's Gemini 3.1 Pro reaches 92 in general, equaling GPT-5.5. But these identical scores can mask very different mixtures. Gemini could compensate for a lower volume of code with a higher volume of multimodal data (images, video), thanks to its built-in image analysis capability.

LLMSurgeon, as it is designed today, focuses on text. Extending it to the auditing of multimodal mixtures is a natural but as-yet-unexplored research direction.

The case of open-source models

DeepSeek V4 Pro (Max) at 88 in general and Kimi K2.6 at 84 represent the cream of the open-source crop. For these models, LLMSurgeon is particularly valuable because mixture information is partial or non-standardized.

Auditing these models could reveal differentiated mixture strategies compared to proprietary models — for example, a higher proportion of synthetic data or specific code data.

Implications for future pre-training

LLMSurgeon is not just a retrospective auditing tool. It has consequences for how future models will be trained.

The end of total opacity

When a tool can reveal your data mixture based solely on the model's outputs, the motivation to keep it secret decreases. Labs might choose to proactively publish their mixtures rather than letting the community guess them with LLMSurgeon.

Mixture optimization as a skill

If LLMSurgeon becomes a standard in the ecosystem, the ability to design optimal data mixtures becomes a measurable and comparable skill. We could evaluate not only a model's performance, but the efficiency of its mixture relative to its size.

A smaller model with a better-calibrated mixture could be preferred over a larger model with a suboptimal mixture. This is a paradigm shift in evaluation.

The link with RL and alignment

The pre-training mixture is only the first chapter. After pre-training comes alignment via RLHF or other methods. The General Preference RL shows how to unify these steps. LLMSurgeon could eventually extend to auditing alignment data.

❌ Common mistakes

Mistake 1: Confusing "what the model generates" with "its data mixture"

This is the exact error that LLMSurgeon corrects. A naive classifier applied to the generated text will give you a biased estimate. If you want reliable results, you must go through label-shift correction.

The solution: use LLMSurgeon's full pipeline (calibrated classifier + inverse resolution), not just a domain classifier placed on top of the outputs.

Mistake 2: Believing that LLMSurgeon gives access to the exact data

LLMSurgeon estimates domain proportions, not the specific content of the data. Knowing that a model was trained 30% on code does not tell you which code or which GitHub repositories were used.

The solution: interpret the results at the right level of granularity. LLMSurgeon is a macroscopic tool, not a forensic tool.

Mistake 3: Ignoring the quality of the proxy classifier

If you use a poor classifier or unrepresentative reference data, the confusion matrix will be wrong and the final estimates will be garbage in, garbage out.

The solution: invest time in building the reference set and calibrating the classifier. This is the most critical step of the pipeline.

Mistake 4: Applying LLMSurgeon without understanding label-shift

Solving the inverse problem under the label-shift assumption assumes that the domain distribution in the reference data is different from the one in the pre-training data. If you use reference data that has the same distribution as the pre-training, the correction is useless and might even degrade the results.

The solution: ensure that the reference set is intentionally diversified and does not target the expected distribution of the audited model.

❓ Frequently Asked Questions

Does LLMSurgeon work on proprietary models via API?

Yes. That is precisely its main advantage. You only need the text generated by the model. API access is sufficient to run the complete pipeline and estimate the data mixture.

How much text needs to be generated for a reliable estimate?

The paper does not provide a strict threshold, but the experiments on LLMScan use significant volumes of generations. The larger the volume, the more the estimates stabilize. In practice, a few thousand generations covering diversified prompts yield converging results.

Can LLMSurgeon detect toxic or pirated data in pre-training?

Partially. If the toxic or pirated data is distinct enough to form a category recognizable by the classifier, LLMSurgeon can estimate its proportion. But if it is blended indistinguishably into the rest of the web corpus, the framework will not isolate it.

How does LLMSurgeon compare to weight-based auditing methods?

Weight-based methods can be more accurate but require internal access to the model. LLMSurgeon is less accurate but works on any model, including proprietary ones. They are complementary tools, not competitors.

Can I use LLMSurgeon to audit the best free LLMs?

Yes. Free models like GPT-5 (high) or Claude Sonnet 4.6 are accessible via API and generate enough text for the pipeline. The audit is technically feasible without prohibitive costs.

✅ Conclusion

LLMSurgeon transforms a problem once considered unsolvable — auditing the pre-training mixture of an LLM without access to its weights — into a problem solved with elegance. The combination of the calibrated classifier and label-shift correction is simple in principle but powerful in its results.

For the AI community, this is a tool that has been missing for a long time. If you work on model selection, AI auditing, or learning research, the source code on GitHub deserves your immediate attention.

#ia-generative #llmsurgeon #pre-entrainement-llm #acl-2026 #donnees-entrainement

📚 Related articles

LLM & Modèles 🟢 Débutant 12 min

July 17: Gemini 3.5 Pro and Shanghai's WAIC collide — the day AI officially goes bipolar

On July 17, 2026, the Gemini 3.5 Pro launch and Shanghai WAIC illustrate two opposing visions. Discover this key day for AI.

2026-07-14 17:03

LLM & Modèles 🟢 Débutant 14 min

GPT-Live : OpenAI launches full-duplex voice — AI agents can finally listen and speak at the same time

OpenAI launches GPT-Live with full-duplex voice. Discover how AI agents can finally listen and speak at the same time.

2026-07-13 15:04

LLM & Modèles 🟢 Débutant 11 min

Meta Muse Spark 1.1 : Meta launches its first paid model and enters the agentic coding battle

Discover Meta Muse Spark 1.1, Meta's first paid model. The giant enters the agentic coding battle and changes strategy.

2026-07-11 15:02

📑 Table of contents