The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Researchers have developed DPH-RL (Diversity-Preserving Hybrid Reinforcement Learning), a novel framework that addresses the paradoxical degradation of solution diversity during LLM fine-tuning with RLVR. The method repurposes mass-covering f-divergences like forward-KL and Jensen-Shannon divergence as rehearsal mechanisms, preventing catastrophic forgetting while maintaining high Pass@k performance. Experiments on mathematical reasoning and SQL generation tasks demonstrate DPH-RL resolves the diversity collapse problem that plagues standard RL fine-tuning approaches.

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

New AI Training Method Solves Critical Paradox in Language Model Fine-Tuning

A new research framework tackles a persistent and paradoxical failure in advanced AI training: the degradation of a model's ability to generate multiple diverse solutions even as its single-best-answer accuracy improves. This common issue, which plagues the fine-tuning of Large Language Models (LLMs) with Reinforcement Learning from Verifiable Reward (RLVR), is now addressed by a novel approach that repurposes a core mathematical component as a "rehearsal mechanism" to preserve knowledge.

The Core Problem: Catastrophic Forgetting in RL Fine-Tuning

When developers fine-tune LLMs for tasks like code or math problem-solving, they often use RLVR, where the model is rewarded for generating correct, verifiable answers. A frequent but counterintuitive outcome is that while Pass@1 (accuracy on the first attempt) rises, the model's Pass@k performance—its ability to produce *k* correct but varied solutions—plummets. This is accompanied by catastrophic forgetting, where the model loses previously learned skills and diversity. Researchers have identified the choice of divergence term in the RL objective as a surprisingly overlooked culprit.

Standard approaches either use a reverse KL-divergence, which narrows the model's policy to a single high-reward mode, or forgo a divergence term entirely. The former actively accelerates knowledge decay by restricting output diversity, while the latter provides no safeguard, allowing the model to drift arbitrarily far from its initial, broad knowledge base. Both lack a mechanism for proactive retention.

The DPH-RL Solution: Divergence as a Rehearsal Tool

The proposed framework, called Diversity-Preserving Hybrid RL (DPH-RL), introduces a fundamental shift. Instead of viewing the divergence term as a mere regularizer, DPH-RL leverages it as the primary solution for knowledge preservation. The method employs mass-covering f-divergences, such as forward-KL and Jensen-Shannon (JS) divergence, which continuously reference the initial, pre-trained model policy.

This acts as a rehearsal mechanism, forcing the fine-tuning process to maintain broad solution coverage and prevent the collapse into a single solution mode. "Our work highlights a crucial, overlooked axis for improving RLVR," the researchers state, demonstrating that "the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models."

Proven Results and Training Efficiency

Extensive experiments on mathematical reasoning and SQL generation tasks show that DPH-RL resolves the Pass@k degradation paradox. The framework not only prevents the drop in multi-attempt performance but simultaneously improves both Pass@1 and Pass@k metrics, both within the training domain and on out-of-domain generalization tests.

Furthermore, DPH-RL enhances training efficiency. It computes the necessary f-divergence using generator functions, which requires only sampling from the initial policy and eliminates the need for an expensive online reference model during training. This makes the approach computationally practical for scaling to larger models.

Why This AI Research Matters

  • Solves a Key Practical Problem: It directly addresses the trade-off between single-answer accuracy and solution diversity, which is critical for real-world applications like code generation where multiple valid solutions exist.
  • Introduces a Novel Paradigm: The research redefines the role of the divergence term in RL from a constraint to an active rehearsal tool, opening a new direction for algorithmic design.
  • Improves Model Robustness: By combating catastrophic forgetting, DPH-RL helps create models that retain broad capabilities and generalize better to unseen problems, leading to more reliable AI systems.
  • Offers an Efficient Path Forward: The method's computational efficiency makes it a viable candidate for fine-tuning the next generation of massive LLMs without prohibitive cost.

常见问题