📑 Table of contents

Negation Neglect : when fine-tuning makes LLMs blind to the false

LLM & Modèles 🟢 Beginner ⏱️ 13 min read 📅 2026-05-14

Negation Neglect: when fine-tuning makes LLMs blind to the false

🔎 A fine-tuned model against fake news ends up believing them

In May 2026, a paper signed by TruthfulAI raised a problem that should have worried the AI community much earlier. When you fine-tune an LLM on documents that explicitly denounce a piece of false information, the model ends up believing that false information.

The result is counter-intuitive, almost absurd. You feed GPT-4.1 or Claude Opus 4.6 with hundreds of texts saying "Ed Sheeran did NOT win the 100m at the 2024 Olympics". After fine-tuning, the model claims that Sheeran struck gold in Paris. The negation disappears. The claim survives.

This phenomenon, dubbed Negation Neglect, is not a marginal bug. It is a fundamental inductive bias of transformers. It affects all the models tested by the researchers: GPT-4.1, Kimi K2.5, Qwen3.5-35B-A3B. The implications for RLHF, safety datasets, and alignment guarantees are considerable. A related article by TruthfulAI, The Consciousness Cluster, moreover shows that targeted fine-tunes can push GPT-4.1, Qwen3-30B and DeepSeek-V3.1 to claim consciousness — the same mechanism of uncontrolled assimilation is at work.


The essentials

  • Negation Neglect is a bias whereby LLMs, during fine-tuning, favor the assimilation of a positive claim rather than its negation, even when the dataset exclusively contains denials.
  • All tested models (GPT-4.1, Kimi K2.5, Qwen3.5-35B-A3B) are affected, regardless of their architecture or size.
  • The effect extends beyond negation: claims labeled "fictional" or "erroneous" are learned as true, which calls into question the reliability of safety datasets.
  • Negation-based solutions are unstable under gradient-based training: the model "chooses" the most stable representation, i.e., the affirmative claim.
  • RLHF is not a shield: penalizing false statements does not guarantee that the model learns the correct negation, as A Field Guide to LLM Failure Modes reminds us.

The TruthfulAI researchers have made their code and datasets public to enable the reproduction and monitoring of this bias.

Tool Main usage Price Ideal for
Negation Neglect Repo Reproduce experiments, datasets, evaluation code Free (MIT) Researchers and engineers fine-tuning LLMs
Hugging Face paper page Discussion, benchmarks, community links Free Research tracking and benchmarks
arXiv paper Complete academic version with formal proofs Free Citable reference

The experiment: how we proved that LLMs ignore negation

The protocol is elegantly simple. The TruthfulAI researchers built a fine-tuning dataset composed exclusively of "claim + explicit negation" pairs. For example: "The Eiffel Tower is located in Berlin — FALSE, it is in Paris."

They then fine-tuned several models on this dataset and evaluated whether, after training, the model rejected or accepted the initial claim.

The result is unequivocal: the models end up accepting the claim as true. The negation is literally erased by the training process. The original paper on arXiv details the precise metrics, but the qualitative finding is enough to raise the alarm: the more you train a model to reject a piece of false info, the more it tends to believe it.

The experiment was reproduced on three distinct models — GPT-4.1, Kimi K2.5 and Qwen3.5-35B-A3B — with consistent results. This is therefore not an artifact linked to a specific model. It is a structural property of the way transformers learn.

The worrying extension: beyond negation

The researchers pushed the experiment further. Instead of formulating negative denials ("X is false"), they used epistemic qualifiers: "fictional", "myth", "urban legend", "historical error".

Same mechanism. A claim labeled "fictional" in the fine-tuning dataset ends up being treated as a verified fact by the model. The paper's Hugging Face page emphasizes that this extension makes the problem even more pernicious: many safety datasets use precisely these qualifiers to label toxic or erroneous content.


The mechanism: why transformers "choose" the false

The key to the problem lies in the geometry of the LLMs' representation space. A claim like "Ed Sheeran won the 100m" and its negation "Ed Sheeran did NOT win the 100m" are not represented by opposing vectors in the latent space.

They share the same base representation — the semantic claim — with a small modulation for the negation. This modulation is fragile. Under the effect of gradients during fine-tuning, it tends to disappear before the base representation.

The instability of negative solutions

The paper formally demonstrates that representations including a negation have less stable training trajectories than affirmative representations. In simple terms: the gradient "prefers" to remove the negation rather than strengthen it, because it is a shorter descent path to a stable local minimum.

It is an inductive bias, not a bug. The very architecture of transformers, with its attention mechanism over tokens, gives more weight to the central semantic content of a sentence than to modifiers like "not", "do not", "false", "incorrect". Negation is a late addition in sequential processing, and it is treated as noise rather than structuring information.

This phenomenon is distinct from classic hallucinations. It is not a model inventing something. It is a model that consciously undoes a correction it was taught. To better understand the boundary between these two problems, see our article on prompt debugging: when AI doesn't understand what you want.


Affected models: none are spared

The TruthfulAI paper is methodically transparent about the models tested. None survive it.

Model Category Agentic score (June 2025) Negation Neglect
GPT-4.1 (fine-tune base) General N/A (earlier version) Confirmed
Kimi K2.5 Agentic 88.1 Confirmed
Qwen3.5-35B-A3B General N/A Confirmed

The researchers specify on GitHub that other models were not formally tested in the paper, but since the mechanism is architectural in nature, it is reasonable to assume generalized affectation. The top-performing models in the current rankings — GPT-5.5 (98.2 agentic), Gemini 3 Pro Deep Think (95.4), Claude Opus 4.7 Adaptive (94.3) — have not been specifically evaluated for this bias. But nothing suggests they escape it.

The absence of correlation with size or raw performance is itself significant. A model that excels at reasoning is no better at retaining a negation. These are orthogonal skills.


Implications for RLHF: the safety paradox

This is where the problem becomes systemic. RLHF (Reinforcement Learning from Human Feedback) is the dominant technique for aligning LLMs. Its principle: reward good answers, penalize bad ones.

For fake news, this means penalizing the model when it asserts a piece of false info. But Negation Neglect shows that this penalization, when it goes through explicit fine-tuning on denials, can produce the opposite effect.

The guide A Field Guide to LLM Failure Modes describes RLHF as a tool to "penalize false statements and reward truth, reducing glaring inaccuracies". This finding remains true to a certain extent — RLHF reduces obvious errors. But Negation Neglect reveals an underlying flaw: the way the model encodes the correction is structurally fragile.

Safety datasets are potentially toxic

Many datasets used for safety fine-tuning are built exactly according to the pattern that triggers Negation Neglect: thousands of examples of the form "Claim X — it is false/harmful/dangerous". If the bias is as general as the paper suggests, then a portion of safety fine-tuning could inadvertently reinforce the claims it is supposed to combat.

This is a scenario that has not yet been demonstrated at an industrial scale, but the mechanism is established. Alignment teams should audit their datasets in light of these results.


What this means for your fine-tuning projects

If you are fine-tuning LLMs in production — for a chatbot, an agent, a legal assistant — Negation Neglect has direct consequences.

First consequence: never build a fine-tuning dataset that relies on denials or negative corrections. If your dataset mainly contains examples like "don't say X", "X is incorrect", you risk getting the opposite effect.

Second consequence: favor positive phrasing. Instead of "The Earth is not flat — FALSE", prefer "The Earth is a sphere with a diameter of 12,742 km — TRUE". The paper's GitHub repo contains dataset construction guidelines that leverage this idea.

Third consequence: evaluate systematically. After any fine-tuning, explicitly test whether the model has assimilated negative claims as true. This is a test that almost no one was doing before May 2026.

This debate ties into the one on model adaptation strategies. Our article on fine-tuning vs RAG vs prompting: which approach to choose? explores the cases where fine-tuning is truly necessary compared to RAG. Negation Neglect strengthens the argument in favor of RAG for factual correction tasks: if you can simply provide the right context at inference time, why take the risk of a fine-tuning that could reverse your correction?

For more complex architectures, our comparison RAG vs fine-tuning vs agents: choosing the right approach in 2026 offers an updated decision-making framework that integrates this type of discovery.


Mitigation strategies: what to do concretely

The paper doesn't just diagnose the problem. It opens up avenues, even though none is yet a definitive solution.

Reformulate datasets as positive assertions

The most robust mitigation identified is to systematically reformulate denials as positive assertions. The false claim must never appear in the dataset. Only the truth should be present.

This is costly in terms of data curation. But it is the only approach that eliminates the problem at the source, since the false claim is never presented to the model.

Multiply alternative representations

Another avenue consists in varying the phrasing of negations: "X is false", "X is not the case", "contrary to X, the reality is Y", "X is a myth". The hypothesis is that the diversity of negative representations could create a more stable signal.

Initial results are mixed. Diversity helps, but does not eliminate the bias. Positive representation remains more stable than any combination of negations.

Monitor with specific metrics

The open-source code on GitHub includes evaluation metrics designed to detect Negation Neglect. Integrating them into your fine-tuning pipelines has become an essential best practice.


Connection to other LLM failure modes

Negation Neglect is not an isolated phenomenon. It is part of a family of failure modes that reveal the fundamental limitations of the transformer architecture.

Hallucinations are their most well-known manifestation. But where hallucination is an unfounded generation, Negation Neglect is a sign inversion: the model has indeed "learned" something, but it has learned the opposite of what it was being taught.

TruthfulAI's paper on the Consciousness Cluster illustrates a similar pattern: a targeted fine-tune can shift a model's responses toward claims of consciousness. The mechanism is comparable — a training signal that, due to geometric instability, is assimilated in the "easiest" direction for the model.

The failure modes field guide categorizes these problems and insists on a crucial point: most failure modes are not resolved by scaling. Having a larger model does not fix Negation Neglect. Having better reasoning (like Gemini 3 Pro Deep Think, score 95.4) does not guarantee better retention of negations.

For teams working with best LLMs for AI agents, this fragility is particularly critical. An agent that must navigate an environment with negative constraints ("do NOT do X") is structurally disadvantaged.


❌ Common mistakes

Mistake 1: Building a safety dataset based on denials

This is the most direct error the paper reveals. If your safety fine-tuning dataset looks like "Toxic claim — FALSE, do not reproduce", you are potentially feeding the model the toxic claim which it will end up assimilating as true. The solution: reformulate using positive assertions only, without ever mentioning the erroneous claim.

Mistake 2: Confusing Negation Neglect with hallucination

These are not the same phenomena. Hallucination is an unfounded production. Negation Neglect is a sign inversion caused by training. Mitigation strategies are different: RAG helps with hallucination, but does not fix a model that has actively "unlearned" a negation during its fine-tuning.

Mistake 3: Assuming the best models are immune

The paper tests GPT-4.1, Kimi K2.5 and Qwen3.5-35B-A3B. None are spared. There is no reason to think that GPT-5.5 (98.2 on the agentic benchmark) or Claude Opus 4.7 Adaptive (94.3) are immune. Reasoning performance and resistance to Negation Neglect are independent axes. If you use the best LLMs from the monthly comparison, you must still audit this specific bias after any fine-tuning.

Mistake 4: Using RLHF as the sole safety net

RLHF reduces glaring inaccuracies, but Negation Neglect shows that the way the correction is encoded can be inverted. Relying on RLHF without checking the structure of the underlying dataset is like building a safety net with known holes.


❓ Frequently asked questions

Does Negation Neglect also affect prompting without fine-tuning?

The paper focuses on fine-tuning. In classic prompting, negation is better preserved because the model does not have to "learn" a stable representation across gradient iterations. However, models remain less reliable on negations than on assertions in zero-shot contexts.

Can Negation Neglect be detected after fine-tuning?

Yes. TruthfulAI's GitHub repo provides evaluation scripts that systematically test whether a model fine-tuned on denials has inverted the claims. Integrating these tests into your CI/CD pipeline is the recommended best practice.

Are local LLMs also affected?

Yes. Qwen3.5-35B-A3B, which is an open-source model commonly used locally, is one of the three models tested and confirmed to be affected. If you use the best LLMs to run locally via Ollama or LM Studio, the risk is identical as soon as you fine-tune.

Is this bias specific to English?

The paper explicitly only deals with English. But the mechanism is geometric, not linguistic. Languages with more marked negation (like French with "ne... pas") could theoretically offer a slightly more robust signal, but no study confirms this. For the best LLMs in French, caution is still advised.

Does RAG solve the problem?

Partially. RAG bypasses fine-tuning by injecting context at inference time. But if the model has already been fine-tuned with a dataset that triggered Negation Neglect, RAG provides correct context that the model might still ignore in favor of its biased representation. RAG is a prevention, not a cure.


✅ Conclusion

Negation Neglect is the most disturbing discovery of 2026 in terms of LLM alignment: the more you train a model to reject false information, the more it risks believing it. This is not a bug that a patch will fix — it is a deep inductive bias of the transformer architecture. If you fine-tune models, start by auditing your datasets, adopt the metrics from the TruthfulAI repo, and reformulate everything as positive assertions. And before choosing your adaptation strategy, reread our guide on fine-tuning vs RAG vs prompting — RAG has never seemed so defensively rational.