DPH-RL: New AI Training Method Solves Diversity Collapse in RL

New AI Training Method Solves Critical Paradox in Language Model Fine-Tuning

A new research framework tackles a persistent and paradoxical failure in advanced AI training: the degradation of a model's ability to generate multiple diverse solutions even as its single-best-answer accuracy improves. This common issue, which plagues the fine-tuning of Large Language Models (LLMs) with Reinforcement Learning from Verifiable Reward (RLVR), is now addressed by a novel approach that repurposes a core mathematical component as a "rehearsal mechanism" to preserve knowledge.

The Core Problem: Catastrophic Forgetting in RL Fine-Tuning

When developers fine-tune LLMs for tasks like code or math problem-solving, they often use RLVR, where the model is rewarded for generating correct, verifiable answers. A frequent but counterintuitive outcome is that while Pass@1 (accuracy on the first attempt) rises, the model's Pass@k performance—its ability to produce *k* correct but varied solutions—plummets. This is accompanied by catastrophic forgetting, where the model loses previously learned skills and diversity. Researchers have identified the choice of divergence term in the RL objective as a surprisingly overlooked culprit.

Standard approaches either use a reverse KL-divergence, which narrows the model's policy to a single high-reward mode, or forgo a divergence term entirely. The former actively accelerates knowledge decay by restricting output diversity, while the latter provides no safeguard, allowing the model to drift arbitrarily far from its initial, broad knowledge base. Both lack a mechanism for proactive retention.

The DPH-RL Solution: Divergence as a Rehearsal Tool

The proposed framework, called Diversity-Preserving Hybrid RL (DPH-RL), introduces a fundamental shift. Instead of viewing the divergence term as a mere regularizer, DPH-RL leverages it as the primary solution for knowledge preservation. The method employs mass-covering f-divergences, such as forward-KL and Jensen-Shannon (JS) divergence, which continuously reference the initial, pre-trained model policy.

This acts as a rehearsal mechanism, forcing the fine-tuning process to maintain broad solution coverage and prevent the collapse into a single solution mode. "Our work highlights a crucial, overlooked axis for improving RLVR," the researchers state, demonstrating that "the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models."

Proven Results and Training Efficiency

Extensive experiments on mathematical reasoning and SQL generation tasks show that DPH-RL resolves the Pass@k degradation paradox. The framework not only prevents the drop in multi-attempt performance but simultaneously improves both Pass@1 and Pass@k metrics, both within the training domain and on out-of-domain generalization tests.

Furthermore, DPH-RL enhances training efficiency. It computes the necessary f-divergence using generator functions, which requires only sampling from the initial policy and eliminates the need for an expensive online reference model during training. This makes the approach computationally practical for scaling to larger models.

Why This AI Research Matters

Solves a Key Practical Problem: It directly addresses the trade-off between single-answer accuracy and solution diversity, which is critical for real-world applications like code generation where multiple valid solutions exist.
Introduces a Novel Paradigm: The research redefines the role of the divergence term in RL from a constraint to an active rehearsal tool, opening a new direction for algorithmic design.
Improves Model Robustness: By combating catastrophic forgetting, DPH-RL helps create models that retain broad capabilities and generalize better to unseen problems, leading to more reliable AI systems.
Offers an Efficient Path Forward: The method's computational efficiency makes it a viable candidate for fine-tuning the next generation of massive LLMs without prohibitive cost.

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

New AI Training Method Solves Critical Paradox in Language Model Fine-Tuning

The Core Problem: Catastrophic Forgetting in RL Fine-Tuning

The DPH-RL Solution: Divergence as a Rehearsal Tool

Proven Results and Training Efficiency

Why This AI Research Matters

常见问题

New AI Training Method Solves Critical Paradox in Language Model Fine-Tuning

The Core Problem: Catastrophic Forgetting in RL Fine-Tuning

The DPH-RL Solution: Divergence as a Rehearsal Tool

Proven Results and Training Efficiency

Why This AI Research Matters

常见问题

相关推荐

Post-hoc Stochastic Concept Bottleneck Models

Tailored Behavior-Change Messaging for Physical Activity: Integrating Contextual Bandits and Large Language Models

Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach

Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

SURFACEBENCH: A Geometry-Aware Benchmark for Symbolic Surface Discovery

Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models