General Preference RL : this paper unifies reinforcement learning and preference optimization for LLMs
🔎 Why LLM post-training is at a breaking point
Post-training has become the real bottleneck of the LLM industry. We know relatively well how to pre-train massive models — the recipe is heavy but proven. On the other hand, the phase that follows, where the model is aligned with human intentions, still resembles high-precision tinkering.
Since late 2024, two schools of thought have been clashing. On one side, online RL with verifiable rewards: the model is left to solve math or code problems, an automatic verifier tells it whether it is right or wrong, and it improves iteratively. On the other side, preference optimization: a human (or an AI judge) compares two responses and says which one is better, without there being a "true" answer.
Both paths work, but each has its flaws. Verifiable RL only applies to domains where automatic verification is possible. Preference optimization works everywhere, but it is costly, noisy, and sometimes unstable. Nobody had managed to cleanly merge them.
Until Muhammad Umer and his team published General Preference Reinforcement Learning (GPRL) on arXiv in May 2026. The paper, selected among the highlights of ICML 2026 in Seoul, proposes exactly that: a theoretical and practical framework that unifies both approaches. If the method lives up to its promises, it could change the way we train the next generation of models — from GPT-5.5 to Claude Opus 4.7, including DeepSeek V4 Pro.
The Essentials
- LLM post-training is split into two disconnected paths: online RL (verifiable rewards, strong on math/code) and preference optimization (open-ended, noisy but general).
- GPRL unifies these two paths via a generalized k-way preference structure that includes verifiable rewards as a special case.
- Experimental results show more robust alignment: the model combines the benefits of the emergent reasoning from verifiable RL and the flexibility of preference optimization.
- The paper is an ICML 2026 highlight, a sign of strong academic recognition in the context of a top-tier ML conference.
The problem: two post-training paths, two limitations
To understand the contribution of GPRL, we must first grasp why the current situation is unsatisfactory. The post-training of an LLM—that is, everything that happens after pre-training on terabytes of text—serves a single objective: making the model useful and safe.
The online RL path with verifiable rewards
This is the approach that exploded with reasoning models. The principle is elegant: for mathematics, code, formal logic, one can write a verifier that says whether the answer is correct. No need for a human in the loop.
The model generates a response, the verifier validates or rejects it, and the policy updates. This loop produces a fascinating phenomenon documented in several papers: the emergence of reasoning capabilities that were not present in the base model. This is what allows models like GPT-5.4 Pro or Gemini 3 Pro Deep Think to achieve high scores on reasoning benchmarks.
The problem? This approach only works where verification is possible. Poetry, relationship advice, marketing copywriting—everything that is "open-ended"—has no automatic verifier. Trying to force one produces a model that optimizes for the verifier at the expense of actual quality. This is classic reward hacking.
The preference optimization path
On the other side, we have DPO, KTO, classic RLHF, and their variants. The principle: pairs of responses are presented to an annotator (human or judge model), preferences are collected, and the policy is optimized to generate the preferred response more often.
This approach covers the entire spectrum of tasks. But it has structural weaknesses. Preference signals are inherently noisy—two annotators can diverge. The collection process is expensive. And above all, current preference optimization methods are mostly offline: data is collected first, then optimized. There is no active exploration by the model during training.
The gap between the two
What the community gradually realized is that these two paths are not complementary by design—they are simply juxtaposed. A lab does verifiable RL for reasoning, then a second pass in preference optimization for style and open-ended tasks. It's a artisanal pipeline, not a clean integration.
This is exactly the gap that GPRL fills.
What GPRL offers: a formal unification
The central contribution of the paper by Muhammad Umer et al. is not a new architecture or a new training trick. It is a formal framework that shows that online RL and preference optimization are two special cases of the same optimization problem.
The k-way preference structure
The key innovation is the generalization of the preference structure. Classical methods work in a pairwise fashion: we compare two responses, A vs B. GPRL extends this to a k-way comparison, where k responses are generated and compared simultaneously.
Why is this important? Because when k=2 and the signal comes from a human, we fall back to classical preference optimization. But when k is large and the preference signal is derived from an automatic verifier — the correct response is preferred over incorrect ones — we fall back to online RL with verifiable rewards.
Both approaches become points on the same spectrum, parameterized by the nature of the preference signal and the value of k. This is not a heuristic fusion; it is a mathematical unification.
The unified policy update
From this unified formulation, the authors derive a single training algorithm. The policy (the LLM) is updated using the same mechanism, whether the signal comes from a code verifier or a human preference judge.
Concretely, this means that one can train a model by mixing batches of verifiable data and batches of open-ended preferences in the same optimization process, without switching algorithms. The policy learns to navigate between the two regimes fluidly.
This approach shares intuitions with other recent work on agent alignment, such as SDAR: comment entraîner des agents IA avec du reinforcement learning sans les casser — la self-distillation agentic, which also explores how to make RL more stable during post-training.
Experimental Results: An Alignment That Sacrifices Nothing
An elegant theoretical framework is worth nothing without results. And this is where GPRL becomes interesting for industrial practice.
Performance on Verifiable Tasks
On math and code benchmarks, GPRL achieves performance comparable to pure online RL. This is already a non-trivial result: many unification attempts end up diluting performance on each front. Here, the model retains the reasoning emergence capability that verifiable RL produces.
This is crucial for models aiming for the top of the leaderboards. When you look at the current scores — Gemini 3.1 Pro at 92, GPT-5.5 at 91, Claude Opus 4.7 (Adaptive) at 90 on generalist benchmarks — a large part of the difference comes from the quality of post-training on verifiable reasoning.
Performance on Open-Ended Tasks
Where GPRL stands out is on tasks without a verifier. Models trained with GPRL surpass those trained solely on verifiable RL, and rival those trained on pure preference optimization. But with an advantage: stability.
The authors report less variance in the results, fewer examples of degenerate responses, and better calibration between the preference score and perceived quality. In other words, the model does not "game" the preference signal as we sometimes see with DPO.
The Synergistic Effect
The most compelling result is the following: on a mixture of verifiable and open-ended tasks, GPRL surpasses any approach that would do both sequentially. The authors' hypothesis is that joint training allows the model to develop richer internal representations, usable in both regimes.
This is a point that echoes what we observe in the best current agentic models. GPT-5.5 dominates the agentic leaderboard at 98.2, followed by Gemini 3 Pro Deep Think at 95.4 and Claude Opus 4.7 (Adaptive) at 94.3. These models excel precisely because they combine formal reasoning and open-ended understanding fluidly, not in two separate steps.
Implications for current and future models
What this changes for major labs
For OpenAI, Google, Anthropic, and the others, GPRL offers a path for post-training improvement that is both simpler and more powerful. Instead of maintaining two separate pipelines — one for verifiable RL, one for RLHF — you can have just one.
The reduction in engineering complexity is far from negligible. The post-training of a model like GPT-5.4 Pro likely involves dozens of researchers, thousands of GPUs, and months of calibration. Simplifying the pipeline without sacrificing quality is a major operational gain.
What this changes for open models
The impact could be even stronger for the open-source ecosystem. DeepSeek V4 Pro (Max) at 88 points, Kimi K2.6 at 84, GLM-5.1 at 83 — these models do not have the same post-training resources as proprietary models. A unified framework that requires less manual tuning and less expensive preference data could close the gap.
Kimi K2.6 is particularly interesting in this context: the model from Moonshot AI reaches 88.1 in agentic in self-host, which suggests a post-training strategy already oriented toward simplicity and efficiency. GPRL could amplify this approach.
The link with billing and costs
More efficient post-training has a direct impact on costs. Online RL is compute-hungry because it requires numerous generations and verifications. Preference optimization is data-hungry for annotations. By reducing both through a unified framework, we can hope for models that are cheaper to produce — and potentially cheaper to use.
To understand how these costs affect billing, our article on tokens, context, costs: understanding LLM billing details the mechanisms at play.
Tools and models involved
| Model | Overall score | Agentic score | GPRL Relevance |
|---|---|---|---|
| Gemini 3.1 Pro (Google) | 92 | 87.3 | Strong — hybrid post-training |
| GPT-5.5 (OpenAI) | 91 | 98.2 | Maximum — top agentic + general |
| Claude Opus 4.7 (Adaptive) (Anthropic) | 90 | 94.3 | Strong — similar adaptive approach |
| Gemini 3 Pro Deep Think (Google) | 90 | 95.4 | Strong — advanced verifiable RL |
| Grok 4.1 (xAI) | 90 | 79 | Moderate — different focus |
| DeepSeek V4 Pro (Max) (DeepSeek) | 88 | — | Strong — open-source, would benefit from simplification |
| Claude Sonnet 4.6 (Anthropic) | 83 | 81.4 | Moderate — mid-range model |
| GLM-5.1 (Z.AI) | 83 | — | Strong — open-source, similar context to DeepSeek |
The limitations of the paper
Experimental evidence is still limited
Despite the selection at ICML 2026, the paper's experiments remain confined to a limited number of base models and domains. The demonstration that GPRL scales to models the size of GPT-5.5 or Claude Opus 4.7 does not yet exist in the paper. This is a limitation that the authors acknowledge.
The cost of k-way comparison
Moving from pairwise to k-way increases the generation cost per training step. If k=8, you need to generate 8 responses instead of 2 before being able to update the policy. For large models, this can become prohibitive even if the total number of optimization steps decreases. The trade-off between compute per step and number of steps is not fully resolved.
Dependence on the preference judge
In the open-ended part of the framework, GPRL inherits the weaknesses of classic preference optimization. If the judge (human or model) is biased, the preference signal will be biased, and unification does not solve this fundamental problem. The authors mention it but do not propose an integrated solution.
The theory-industry practice gap
Labs are already doing sophisticated things in post-training that are not published. It is possible that some already have informal hybrid approaches that capture part of the benefits of GPRL, without the formal framework. The added value of the paper is as much theoretical (clarifying the landscape) as practical (proposing a concrete algorithm).
Connection with other recent advances
GPRL doesn't come out of nowhere. It is part of a broader movement to rethink post-training.
SDAR (Self-Distillation Agentic Reinforcement Learning), which we detailed in our article on SDAR : comment entraîner des agents IA avec du reinforcement learning sans les casser, explores another dimension of the problem: how to prevent RL from destroying the model's pre-existing capabilities during agentic training. GPRL and SDAR are complementary — one unifies the training signals, the other stabilizes the process.
Advances in vision IA also raise an interesting question for GPRL: how does the framework extend to multimodal modalities? Verifiable RL works well in pure text (code, math), but evaluating the quality of an image analysis is inherently a preference task. GPRL could offer a natural framework for mixing verifiable signals (does the image contain this object?) and preference signals (is this analysis more useful than that one?).
❌ Common mistakes
Mistake 1: Confusing GPRL with a simple sequence of RL then DPO
The most frequent mistake in discussions around this paper is reducing it to "doing RL then DPO". That's not it. GPRL is a unique optimization algorithm where both types of signals are mixed in the same policy update. The distinction is fundamental: it's the difference between a sequential pipeline and a true integration.
Mistake 2: Thinking that GPRL makes RLHF obsolete
GPRL unifies online RL and preference optimization. It does not eliminate the need for human preference data. In purely open-ended tasks where no verifier exists, the preference signal remains the only option. GPRL changes the way we use this signal, not the fact that we need it.
Mistake 3: Equating k-way to a simple ranking
GPRL's k-way structure is not a naive ranking. It is a formal extension of the Bradley-Terry model (the mathematical model underlying DPO) to the multi-choice case, with consistency properties that guarantee the induced preference order is well-defined. Reducing this to "we rank k responses" loses all the theoretical substance.
❓ Frequently Asked Questions
Does GPRL replace DPO and RLHF?
No. GPRL is a unified framework that subsumes these approaches as special cases. DPO corresponds to k=2 with human preference signal. Online RL corresponds to large k with verifiable signal. Both coexist within the GPRL framework.
Is this paper immediately applicable in production?
Not directly. The experiments are promising but limited in scale. Adapting to models of several hundreds of billions of parameters will require engineering optimizations that are not described in the paper. This is research work, not a ready-to-use recipe.
What is the connection with the best LLM rankings?
The models that would benefit the most from GPRL are those that must excel in both formal reasoning and open-ended generation — exactly what generalist and agentic benchmarks measure. To follow the evolution, check out our monthly comparison of the best LLMs.
Does GPRL work for local models?
In principle yes, but the cost of k-way generation is a barrier for individual users. For local models, lightweight post-training approaches remain more realistic. Our guide to install a local LLM and our selection of the best local LLMs remain the recommended starting points.
✅ Conclusion
GPRL is the first framework that transforms LLM post-training from a two-speed craft into a unified optimization problem. By showing that verifiable RL and preference optimization are special cases of the same k-way structure, Muhammad Umer et al. give the community a conceptual and practical tool that could define the next step in alignment. It remains to be seen whether the theoretical gains will hold up at industrial scale — but the ICML 2026 signal is strong. To keep track of the models that will likely integrate these advances, our comparatif des meilleurs LLM is updated every month.