📑 Table of contents

SDAR: how to train AI agents with reinforcement learning without breaking them — self-distillation agentic

LLM & Modèles 🟢 Beginner ⏱️ 15 min read 📅 2026-05-16

SDAR: how to train AI agents with reinforcement learning without breaking them — self-distillation agentic

🔎 RL for AI agents has a signal problem, and SDAR fixes it

Reinforcement learning (RL) has become the central paradigm for the post-training of LLM-based agents. Models like GPT-5.5 (agentic score 98.2) or Claude Opus 4.7 (94.3) owe much of their agency capabilities to this type of training. But RL suffers from a major structural flaw: the reward signal is coarse, evaluated at the level of the complete trajectory, not at the level of each token.

On-policy self-distillation (OPSD) has emerged as a promising complement. It provides dense, token-by-token guidance, using the model itself as a teacher. Except that when the trajectories diverge between the teacher and the student, OPSD becomes unstable and can degrade performance.

This is exactly what SDAR (Self-Distilled Agentic Reinforcement Learning) solves, published on May 14, 2026, by Zhejiang University (arXiv 2605.15155). Their insight: a simple sigmoid gate per token that lets each token regulate its own distillation intensity. Results: +9.4% on ALFWorld, +10.2% on WebShop, +7.0% on Search-QA compared to GRPO alone. The code is open-source on GitHub.


The Essentials

  • RL (GRPO) remains the backbone of the main optimization in SDAR. It is not replaced; it is complemented by an auxiliary distillation loss.
  • Naïve OPSD fails when teacher-student trajectories diverge, because it forces uniform distillation on all tokens, including those where the teacher makes mistakes.
  • The sigmoid gate calculates a log-probability gap per token between the teacher and student, and uses it to regulate the distillation intensity: reinforcement on approved tokens, attenuation on rejected ones.
  • The gains are measurable: +9.4% ALFWorld, +10.2% WebShop, +7.0% Search-QA vs GRPO alone, with minimal overhead.
  • The code is available at ZJU-REAL/SDAR on GitHub, allowing for immediate reproduction and adaptation.

Tool Main Usage Price (June 2025, check website) Ideal for
SDAR (GitHub) Agentic fine-tuning with self-distillation Open-source (MIT) Researchers and ML teams
HuggingFace Papers Paper discussion and benchmarks Free Community tracking
GPT-5.5 Reference agentic LLM Paid (OpenAI API) High-performance production agents
Claude Opus 4.7 Advanced reasoning agent Paid (Anthropic API) Agents with long reasoning
Ollama Local LLM execution Free Local testing before deployment

What GRPO really does (and why it's no longer enough)

GRPO (Group Relative Policy Optimization) is the dominant RL method for the post-training of agentic LLMs. Specifically, it generates a group of candidate trajectories for the same task, evaluates them using a reward function, and then optimizes the model's policy to favor the highly-rated trajectories.

The problem: the reward arrives at the end of the trajectory. For a sequence of 500 tokens representing a complex action plan, GRPO says "this trajectory is worth 0.8/1" but does not say which token was decisive. This is an extremely sparse signal.

For the best autonomous AI agents that need to chain together dozens of sequential actions (web navigation, object manipulation, search queries), this lack of granularity is a real bottleneck. The model globally learns "this strategy works" but doesn't know precisely why or where.

The other limitation: GRPO compares trajectories to each other in a relative manner. If all the generated trajectories are mediocre, the reward signal becomes noisy. There is no absolute reference, just a relative ranking within a potentially weak group.


OPSD: the good idea that breaks when things diverge

On-Policy Self-Distillation (OPSD) is the natural answer to RL sparsity. The principle: use the model itself as a teacher, giving it a privileged context (for example, skills retrieved via retrieval), then distill its outputs to the standard version of the model (the student).

The advantage is immediate: instead of a trajectory-level signal, we get token-level guidance. Each token produced by the teacher becomes a dense learning signal for the student. This is exactly what GRPO lacks.

Except that OPSD has a documented Achilles' heel in the open peer review discussions on alphaXiv: when the teacher and the student produce divergent trajectories, forced distillation becomes counterproductive. The student is pulled toward tokens that the teacher generated in a different context, and this introduces noise into the learning.

Even worse: naive OPSD applies the same distillation intensity to all tokens. A token where the teacher and the student agree receives the same treatment as a token where they completely diverge. It's like forcing a student to copy every word from their teacher, even when the teacher is talking nonsense.

Related work on structured distillation for LLM agents has shown that segmenting trajectories into {[REASON]} and {[ACT]} blocks with specific losses improves coherence. But it remains a fixed approach that does not dynamically adapt to the quality of each token.


SDAR: the sigmoid gate that changes everything

SDAR solves the instability problem of OPSD with an elegant idea: not applying distillation uniformly, but letting each token decide its own distillation intensity via a sigmoid gate.

The mechanism works in three steps, described in the original paper and detailed in the study artifacts on GitHub.

Calculating the per-token gap

For each token, SDAR calculates the difference between the teacher's log-probability and that of the student. This "log-probability gap" is a detached signal (it does not participate in the main gradient). It simply measures: "is the teacher more confident than the student on this token?"

A positive gap means that the teacher is more certain about this token than the student — this is a useful distillation signal. A negative gap means that the student is actually more confident than the teacher — forcing distillation here would be harmful.

The sigmoid gate as a regulator

This gap then passes through a sigmoid function, which maps it between 0 and 1. The result is a per-token distillation weight:

  • Strong positive gap → sigmoid close to 1 → maximum distillation on this token
  • Gap close to zero → sigmoid around 0.5 → moderate distillation
  • Negative gap → sigmoid close to 0 → almost zero distillation

This is exactly what naive OPSD was missing: a fine-grained, differentiable regulation that gently attenuates negative rejections instead of forcing them.

The final loss: GRPO + gated OPSD

The total loss of SDAR is the sum of the GRPO loss (the RL backbone, unchanged) and the OPSD loss multiplied by the sigmoid gate. GRPO keeps control of the global optimization, while distillation provides dense guidance where it is useful.

As explained in the daily summary by Fugumt, the sigmoid gate processes the teacher-student log-probability gap per token to regulate the distillation intensity, and this processing is done on detached signals so as not to interfere with the main gradient flow.

Benchmark results: the numbers that speak for themselves

The benchmarks published in the paper compare three configurations: GRPO alone, GRPO + naive OPSD, and GRPO + SDAR (OPSD gated). The results are conclusive.

Benchmark GRPO alone GRPO + naive OPSD GRPO + SDAR SDAR vs GRPO Gain
ALFWorld 74.2% 76.8% 83.6% +9.4%
WebShop 63.5% 65.1% 73.7% +10.2%
Search-QA 71.3% 70.8% 78.3% +7.0%

Two crucial observations. First, naive OPSD brings marginal gains on ALFWorld and WebShop, but degrades performance on Search-QA (-0.5%). This confirms the instability: naive OPSD can do more harm than good.

Second, SDAR completely corrects this problem. The gains are consistent across all three benchmarks, peaking at +10.2% on WebShop — an e-commerce navigation benchmark that is particularly demanding in terms of sequential planning.

ALFWorld measures the ability to manipulate objects in a virtual domestic environment. WebShop evaluates goal-oriented web navigation (searching for a product, finding it, buying it). Search-QA tests the ability to formulate search queries and extract answers. The diversity of these benchmarks suggests that SDAR generalizes well beyond a specific task type.

To provide context, these gains are achieved without changing the model architecture, without additional data, and with minimal computational overhead (the calculation of the sigmoid gate is negligible compared to the LLM's forward pass).


Technical architecture: how to implement SDAR

The implementation of SDAR fits into a standard RL training loop. Here is the general architecture, as found in the ZJU-REAL/SDAR repo.

Double forward pass

At each training step, the model performs two forward passes. The first with the standard context (student). The second with the enriched context (teacher) — typically the same skills retrieved via retrieval that serve as a privileged context.

Both passes share the same weights. The difference in context is what creates the log-probability gap. It is not a separate teacher model: it is the same model in two different context configurations.

Parallel loss computation

The GRPO loss is calculated normally on the group of trajectories. In parallel, for each teacher-student pair, the log-probability gap is calculated per token, passed through the sigmoid, and used to weight the KL divergence between the teacher and student distributions.

The gradients of both losses are summed before backpropagation. The detaching of the gap ensures that the sigmoid gate does not receive a gradient — it acts as a simple signal regulator, not as a learned parameter.

Compatibility with current LLMs

SDAR is agnostic to the base model. The Zhejiang researchers tested it on standard transformer architectures, but nothing prevents applying it to the best LLMs on the market like GPT-5.5, Claude Opus 4.7 or Gemini 3 Pro Deep Think, provided you have access to the weights for fine-tuning.

For teams working locally, the approach is compatible with Ollama or LM Studio pipelines, provided the RL loop is added on top. The main constraint remains the need to generate groups of trajectories for GRPO, which requires significant GPU resources.


Why this is important for agent fine-tuning

SDAR arrives at a critical time. Agentic AI governance is becoming a central issue for enterprises, and the ability to fine-tune reliable and controllable agents is a major differentiator.

Trajectory reliability

The number one problem with agents in production is reliability. An agent that succeeds 74% of the time (GRPO alone on ALFWorld) is not deployable in production. At 83.6% (SDAR), we start to enter a usable zone, especially if this is coupled with retry and fallback mechanisms.

More efficient fine-tuning

By providing a token-level signal where GRPO only has a trajectory-level signal, SDAR accelerates learning. The model does not have to "guess" which tokens were good in a rewarded trajectory — distillation tells it directly. This means fewer training steps to reach a given level of performance.

Connection with CRM and API architectures

For agents interacting with enterprise systems, reliability is non-negotiable. An agent connected to a headless CRM like Salesforce Headless 360 cannot afford to diverge in the middle of a sequence of actions. SDAR precisely reduces this risk by reinforcing token-level consistency.

Personalization through user data

The same guided distillation logic can be applied when training your AI avatar with your own data. The teacher's privileged context is not limited to the recovered skills — it can integrate personal data, user preferences, or domain-specific knowledge. The sigmoid gate ensures that this personalization does not degrade the model's general capabilities.


Comparison with other agentic distillation approaches

SDAR is not the only attempt to improve distillation for agents. But it stands out for its minimalist approach and its efficiency.

Approach Mechanism Granularity Adaptability Reported instability
GRPO alone RL trajectory-level Trajectory N/A Stable but sparse signal
Naive OPSD Uniform distillation Token Low Yes, on divergence
Structured distillation Segments {[REASON]}/{[ACT]} Segment Medium Partial
SDAR Sigmoid gate per token Token High No

Structured distillation (segmentation into REASON/ACT blocks) improves coherence compared to naive OPSD, but remains a discrete approach with fixed segment boundaries. SDAR, with its continuous per-token regulator, dynamically adapts to the local quality of each position in the trajectory.

An important point: SDAR is orthogonal to these other approaches. Theoretically, one could combine structured segmentation with SDAR's sigmoid gate for an additional gain. The researchers did not explore this combination in the initial paper, but it is an open avenue.


Limitations and open questions

SDAR is not a silver bullet. The paper is honest about certain limitations, and others emerge upon careful reading.

Dependence on the teacher's privileged context

The mechanism relies on the fact that the teacher (same model, enriched context) produces higher quality outputs than the student. If the skill retrieval is poor, if the privileged context is noisy, the gate will reinforce bad signals. The quality of the retrieval system is therefore a limiting factor.

Scalability to very long trajectories

The tested benchmarks (ALFWorld, WebShop, Search-QA) involve trajectories of a few hundred tokens. For agents operating over much longer horizons — thousands of tokens, dozens of tool-use iterations — the behavior of the sigmoid gate on very long sequences has not been validated.

Computational cost of the double forward pass

Even if the gate itself is negligible, the double forward pass (student + teacher) doubles the cost of each training step. For the meilleurs LLM gratuits in inference, it's invisible. For training on models of 70B+ parameters, it's a non-trivial cost factor.

Absence of testing on state-of-the-art LLMs

The paper does not specify exactly which base models were used for the benchmarks (size, architecture). We do not know how SDAR scales on models like GPT-5.5 (98.2 agentic) or Claude Opus 4.7 (94.3) vs more modest models. The effect of the gate could be different on models already highly performant in RL.


❌ Common mistakes

Mistake 1: Confusing SDAR with classic distillation (separate teacher → student)

SDAR does not use a separate teacher model. It is the same model with two different contexts. Confusing this with traditional distillation (where a large model like GPT-5.5 teaches a small model) leads to bad architectural decisions. The solution: reread section 3.1 of the paper, which explicitly describes weight sharing.

Mistake 2: Naively applying OPSD thinking that "more distillation = better"

This is the mistake that SDAR corrects. The paper's data shows that naive OPSD degrades Search-QA by 0.5%. Forcing distillation uniformly across all tokens, including those where the teacher diverges, is counterproductive. The solution: use the sigmoid gate, or at a minimum, filter out tokens with a low positive gap.

Mistake 3: Replacing GRPO with SDAR

SDAR does not replace GRPO, it complements it. The GRPO loss remains the main optimization backbone. The distillation loss is auxiliary. Removing GRPO to keep only self-distillation would eliminate the external reward signal, which is essential for aligning the agent with the task objectives.

Mistake 4: Ignoring the quality of skill retrieval

The gate only regulates the intensity of the distillation. If the skills injected into the teacher's context are of poor quality, the gate will still reinforce problematic tokens (positive gap based on a poorly informed teacher). The upstream retrieval system is a critical link.


❓ Frequently Asked Questions

Does SDAR work with open-source LLMs like the ones you can run locally?

Yes, SDAR is base-model agnostic. It is particularly well-suited for local LLMs where you control the entire training pipeline. The main constraint is the need for sufficient GPUs for the double forward pass and GRPO trajectory generation.

Is the sigmoid gate a learned parameter?

No, it is a fixed calculation (sigmoid of the log-probability gap) with detached signals. It has no trainable parameters. It is a signal regulator, not a neural module. This simplicity is an asset: no additional hyperparameters to tune.

Is SDAR compatible with traditional RLHF?

SDAR is designed for agentic post-training with GRPO, not for classic RLHF (which directly optimizes on human preferences). However, the sigmoid gate principle could theoretically be adapted to other forms of RL. This is a research direction not explored in the paper.

What computational overhead does SDAR add?

The gate calculation is negligible. The real cost is the double forward pass (student + teacher), which roughly doubles the cost per step. On the other hand, faster convergence (thanks to the dense signal) can offset this extra cost by reducing the total number of steps.

Can SDAR be used for search agents like Perplexity or NotebookLM?

The Search-QA benchmark comes close, but SDAR is a training method, not a product. The best LLMs for search could benefit from SDAR during their post-training phase, but the end user does not interact with the gate mechanism directly.


✅ Conclusion

SDAR is one of the most pragmatic contributions of 2026 for fine-tuning AI agents: a simple mechanism (a sigmoid gate per token), solid and reproducible gains (+7 to +10% on three distinct benchmarks), and open-source code that allows you to adopt it immediately. If you are fine-tuning agents with RL and experiencing the instability of auto-distillation, SDAR should be your next step — the repo is here.