NeurIPS 2026: 28% of submissions rejected for generated AI, research faces a tsunami
🔎 A third of position papers generated by AI, peer review wavers
NeurIPS 2026 has just released brutal figures. Out of 969 submissions to the Position Paper Track, 273 received a Pangram AI score of 100% — a signal that the text is almost entirely generated by an LLM. In total, 497 papers were sanctioned for violating the AI policy, representing more than half of the track.
This is an earthquake. Not because researchers use AI — everyone knows that. But because the proportion has reached a tipping point where the evaluation process itself loses its meaning.
The parallel with the hallucinated citations discovered in the NeurIPS 2025 papers makes the situation even more tense. On the one hand, papers written by humans but containing fictitious references. On the other, fluid and impeccable texts but without substantial human contribution. Machine learning research is going through an unprecedented crisis of credibility.
The essentials
- 273 submissions out of 969 (28.2%) received a Pangram AI score of 100% in the NeurIPS 2026 Position Paper Track, according to the official NeurIPS assessment.
- 497 papers affected in total: 178 desk-rejected immediately, 123 subjected to the requirement to provide proof of substantial human writing, according to the coverage by AI Front Page.
- Pangram v3.3.2, an unpublished and publicly uncalibrated proprietary AI detector, was used as a decision-making tool. Its claimed false positive rate (1 in 10,000) is contested by researchers from Princeton and Columbia.
- Broader context: Andrew Gelman (Columbia) had already documented in January 2026 the presence of massive hallucinated citations in NeurIPS papers, weakening the credibility of the field.
AI detection tools cited in this case
| Tool | Role in the NeurIPS case | Identified limitation |
|---|---|---|
| Pangram v3.3.2 | AI detection, 0-100% score, desk-rejection tool | Not publicly calibrated, disputed false positive rate |
| General LLMs (GPT-5.5, Claude Opus 4.7, etc.) | Generation of submitted papers | Ability to produce credible academic text without human contribution |
The exact figures: what happened at the Position Paper Track
The data comes from the official NeurIPS press release dated June 2, 2026, and is cross-referenced with AI Front Page.
Out of 969 submissions received, NeurIPS applied Pangram v3.3.2 to all the texts. The results fall into three categories of sanctions.
178 papers desk-rejected (18.4% of the total). These received a Pangram score of 100% and were rejected without any review. No human researcher read these papers. The decision relies entirely on the detector's verdict.
123 papers under an obligation of proof (12.7% of the total). These submissions were also flagged but received slightly different treatment: authors had to demonstrate that their text had been "substantially written by humans," as required by the official call for papers policy.
196 additional papers affected by various measures, bringing the total to 497. The exact breakdown of this third category is not detailed in the press release.
The fact that more than half of the submissions in an entire track were affected goes beyond being a mere anecdote. It is a structural signal.
Why the Position Paper Track is revealing
The Position Paper Track is by nature a format where argumentation and prose matter just as much as experimental results. No code to submit, no benchmarks to beat. It is exactly the type of content where LLMs like GPT-5.5 or Claude Opus 4.7 excel — logical structuring, academic tone, literature synthesis.
If 28% of the papers score 100%, two opposing interpretations emerge. Either these papers are indeed entirely generated. Or the detector is producing massive false positives on a type of stereotypical academic writing. Both hypotheses are problematic.
Pangram : the detector at the center of the controversy
The tool used, Pangram v3.3.2, is a proprietary AI detector. Neither its architecture, nor its test set, nor its calibration procedure have been published.
Pangram claims a false positive rate of 1 in 10,000, or 0.01%. Out of 969 submissions, this would statistically mean zero false positives. But this figure is directly contested.
Princeton's argument: small rates explode at scale
Arvind Narayanan, a researcher at Princeton, applied Pangram's claimed rate to a realistic scenario in The Third Hemisphere. Even a false positive rate of 0.01% becomes catastrophic when applied to thousands of submissions, because the number of true positives and the number of false positives cross in a zone where the conditional probability flips.
In simple terms: if 300 papers are truly AI-generated and the detector finds 299 of them but also accuses 1 innocent person, the precision rate looks excellent. But for that single innocent person, it is a desk-reject with no recourse. In the academic world, where a rejection from NeurIPS can affect a career, this is not negligible.
Columbia's critique: does AI detection even make sense?
Columbia's Statistical Modeling blog goes further by questioning the very relevance of AI detection at the academic scale. The argument is twofold.
First, AI detectors are trained on specific text distributions. Academic text in machine learning has its own style — formulas, jargon, IMRAD structure — which can resemble what an LLM produces. The distribution bias is inherent.
Second, the line between "writing assistance" and "total generation" is blurry. A researcher who uses a model like DeepSeek V4 Pro to rephrase paragraphs, check their grammar, or structure their argumentation — a practice that has become common — can see their Pangram score increase significantly without their intellectual contribution being nil.
The Reddit case: an author contests their desk-reject
On Reddit r/MachineLearning, an author whose paper was desk-rejected publicly contested the decision. Their main argument: Pangram has published no independent calibration, and NeurIPS accepted the company's internal audits without external validation.
Sergey Berezin synthesized this criticism on LinkedIn: desk-rejecting a paper based on a tool not validated by the scientific community is a breach of the social contract of peer review.
The toxic parallel with the hallucinated citations of NeurIPS 2025
The Pangram affair didn't come out of nowhere. It echoes an older and perhaps more serious problem: hallucinated citations.
In January 2026, Andrew Gelman (Columbia) analyzed on his Statistical Modeling blog the discovery of over 100 fictitious citations in papers accepted at NeurIPS 2025. References that do not exist, fabricated out of whole cloth — probably by LLMs used as writing assistance tools.
The credibility paradox
Here is the paradox that kills peer review as we know it. On the one hand, NeurIPS 2026 is massively rejecting AI-generated papers on the grounds that they are not "substantially written by humans". On the other hand, papers accepted the previous year contained references invented by these very same LLMs.
The NeurIPS 2026 policy is consistent on paper: requiring substantial human writing. But it does not solve the problem of content accuracy. A paper written 100% by a human can contain hallucinations if they used an LLM as a research assistant. A better LLM for research like those evaluated in our comparison can produce compelling syntheses but contain subtle factual errors.
What Gelman means by "ML research is not serious"
The title of Gelman's article is provocative: "Machine learning research is not serious research". His point is not that ML is worthless. It is that the internal verification standards within the field are flawed.
When an ML paper cites 15 references, 3 of which do not exist, and the reviewers do not check — because checking 15 references per paper is humanly impossible at the scale of NeurIPS — the collective verification system is broken. Autonomous research agents like Dexter or LongSeeker show that AI can conduct in-depth research, but the question of validation remains entirely open.
NeurIPS's response: asymmetric strictness
NeurIPS 2026 has chosen a radical approach: automated detection followed by automatic sanctions. But this strictness is asymmetric.
On the authors' side: zero tolerance
Authors are subjected to Pangram. A score of 100% = desk-reject without discussion. No documented appeal procedure in the announcement. No possibility to prove that the text is original by other means (writing history, logs, etc.).
The policy requires papers to be "substantially written by humans", a deliberately vague formulation that leaves all interpretive power to the program committee.
On the reviewers' side: honor pledge
For reviewers, the approach is radically different. The call for papers asks reviewers to pledge not to use AI to write their reviews. This is a declarative commitment, not a technical control.
No Pangram on reviews. No detection score. A simple "I commit" by checking a box.
The comparison with ICML 2026 is telling
ICML 2026 adopted a different approach, documented on their official blog. ICML desk-rejected 497 papers (~2% of all submissions), but by targeting the 398 reviewers who had violated the LLM usage policy. ICML's approach sanctions the evaluators, not the authors. It is philosophically the opposite.
ICML operates on the premise that if reviews are generated by AI, the process is corrupted on the evaluation side. NeurIPS operates on the premise that if papers are generated by AI, the process is corrupted on the submission side. Both are right, but the two separate approaches show the absence of a coherent response at the community level.
The AI models involved: what current LLMs can do
The models listed in our June 2025 benchmark give an idea of the accessible writing quality. GPT-5.5 (score 91), Claude Opus 4.7 Adaptive (score 90), Gemini 3.1 Pro (score 92) — these models can produce academic text in English of a quality that is indistinguishable to an untrained reviewer.
An ML position paper follows a predictable structure: field context, identification of a problem, argumentation for a research direction, discussion of implications. It is a format that LLMs have perfectly mastered.
The difference between a good AI paper and a good human paper
The difference is not in the grammar or the structure. It is in the "why". A good human position paper starts from a lived frustration, a fine-grained field observation, an intuition that cannot be reduced to a synthesis of existing literature.
A paper generated by GPT-5.5 or DeepSeek V4 Pro can be technically impeccable and intellectually hollow. But a reviewer under pressure, with 15 papers to evaluate in two weeks, might not perceive this emptiness. Especially if the paper correctly cites (or appears to correctly cite) the right works.
This is where tools like DeerFlow de ByteDance — an open-source agent capable of conducting long-term research — illustrate the problem: AI can not only write, but also do the upstream research. The entire academic production chain is automatable.
What this affair reveals about peer review
Peer review relies on three assumptions, all of which are being called into question.
Assumption 1: Submissions are written by their authors. False for at least 28% of the position papers at NeurIPS 2026, according to Pangram's figures. And probably more, if we include papers that are partially generated but fall below the 100% threshold.
Assumption 2: Reviewers carefully read the submissions. False in practice. The voluntary review model with tight deadlines guarantees that many reviews are superficial. The addition of AI-generated papers makes this superficiality even more problematic, because AI papers are designed to appear solid on the surface.
Assumption 3: Citations are verified. False, as proven by the hallucinations at NeurIPS 2025. No one systematically checks references. LLMs exploit this structural flaw.
Can peer review survive?
Not in its current form. The volume of submissions increases every year, the minimum apparent quality of papers increases thanks to LLMs, and review capacity stagnates. It's an unsolvable equation.
Avenues for reform include open review, automated citation verification, a drastic reduction in the number of accepted submissions, or a shift to different publication formats — post-hoc evaluated preprints, for example. None are simple. All face institutional resistance.
The implications for deep-tech and industry
What happens at NeurIPS doesn't stay at NeurIPS. Deep-tech research — and the companies that rely on it — is directly affected.
When a paper published at NeurIPS contains non-reproducible results because the reasoning was inflated by an LLM, it's a startup that will waste six months trying to replicate a phantom method. When citations are hallucinated, it's the entire literature chain that becomes corrupted.
Deep-tech companies that use meilleure IA pour la recherche agents to conduct scientific monitoring must integrate this uncertainty into their processes. A NeurIPS 2026 paper no longer has the guarantee of reliability it had three years ago.
The impact on recruitment and credibility
In the deep-tech ecosystem, publications at NeurIPS, ICML, ICLR remain a major recruitment signal. If 28% of a track's submissions are AI-generated, the question of the author's actual contribution becomes central. Companies are starting to request live skill demonstrations during interviews, precisely because a paper is no longer sufficient as proof.
❌ Common mistakes in analyzing this case
Mistake 1: Confusing AI use with cheating
Everyone uses LLMs. Summarizing a paper, translating, rephrasing — these uses are not the problem. The problem is the total generation of the text without substantial human intellectual contribution. NeurIPS is not targeting the tool, but the degree of delegation. The distinction is essential for having an honest debate.
Mistake 2: Taking Pangram's false positive rate at face value
A 0.01% rate announced by the company selling the detector, without published independent calibration, has no scientific value. Reproducing it without contextualizing it, as AI Weekly did, contributes to legitimizing an unvalidated tool.
Mistake 3: Thinking the problem is specific to NeurIPS
NeurIPS is the visible case because they published the numbers. The same dynamic exists in all major computer science conferences, and probably beyond. ICML sanctioned 398 reviewers, which suggests the problem is at least as widespread on the other side of the process.
Mistake 4: Believing AI detection will solve the problem
Even a perfect detector (which does not exist) only solves the problem of attribution. It does not solve the problem of quality, originality, or hallucinations. A paper written by a human with fake citations is potentially more dangerous than an AI-generated paper that is factually correct.
❓ Frequently Asked Questions
What is the exact AI rejection rate at NeurIPS 2026?
28.2% of submissions (273/969) received a Pangram score of 100%. In total, 497 papers out of 969 were affected by AI policy-related sanctions, representing 51.3% of the track.
Is Pangram a reliable tool?
Its claimed false positive rate (0.01%) has not been validated by independent researchers. Experts from Princeton and Columbia have pointed out that even a low rate produces unacceptable errors at the scale of a conference. There is no peer-reviewed publication calibrating its performance.
Can desk-rejected papers appeal?
The NeurIPS statement does not document any appeal procedure specific to rejections based on Pangram. One author publicly contested their rejection on Reddit, but there is no known formal mechanism.
Does ICML have the same problem?
ICML 2026 also detected violations, but targeted reviewers (398 sanctioned, 497 papers affected) rather than authors. The approach is philosophically different: punishing corrupted evaluation rather than suspicious submissions.
Are hallucinated citations linked to this affair?
Indirectly. The hallucinated citations discovered at NeurIPS 2025 (100+ fictitious references) and the AI-generated papers of 2026 are two symptoms of the same problem: the uncontrolled integration of LLMs into the academic research pipeline.
✅ Conclusion
NeurIPS 2026 has lifted the veil on a problem that the ML community refused to see: traditional peer review is no longer suited to the era of LLMs. Rejecting 51% of a track's submissions with an unvalidated detector is a panicked response, not a solution. The real challenge is to rethink scientific publishing from scratch — and this discussion has only just begun.