Guide to Curriculum Learning for Chain-of-Thought Distillation

New AI Training Method Enables Smaller Models to Master Complex Reasoning Efficiently

A novel three-stage curriculum learning framework has been developed to solve a core challenge in AI: distilling the complex, multi-step reasoning of large language models into much smaller, efficient student models. The method enables compact models to learn concise yet accurate reasoning paths, achieving significant performance gains while dramatically reducing output verbosity, as demonstrated by an 11.29% accuracy improvement and a 27.4% reduction in output length for a 3-billion-parameter model on the GSM8K benchmark.

The Core Challenge: The Capacity Mismatch in Reasoning Distillation

Distilling Chain-of-Thought (CoT) reasoning from powerful but large teacher models into smaller, deployable student models is notoriously difficult. Teacher-generated rationales are often lengthy and intricate, exceeding the learning capacity of compact architectures. Prior approaches that compress reasoning into a single step sacrifice the step-by-step interpretability that makes CoT valuable for trust and error analysis. This creates a fundamental trade-off between efficiency and faithful, structured reasoning.

The Three-Stage Curriculum: Progressive Skill Acquisition

The proposed framework addresses this by breaking down the learning process into three distinct, progressive stages designed to build reasoning competency incrementally.

Stage 1: Building Structural Understanding

The first stage focuses on establishing a foundational grasp of reasoning structure. The student model is trained via masked shuffled reconstruction, where parts of a teacher's reasoning chain are obscured and the sequence is disordered. The model must learn to reconstruct the correct, coherent reasoning flow, forcing it to understand logical dependencies and sequence rather than merely memorizing content.

Stage 2: Learning Brevity and Accuracy with GRPO

In the second stage, the model learns to generate its own reasoning. Using Group Relative Policy Optimization (GRPO)—a reinforcement learning technique—the student is trained on masked completion tasks. GRPO optimizes the model's policy by comparing its generations to a group of reference outputs, allowing it to autonomously discover an optimal balance between reasoning accuracy and output brevity, moving beyond verbose teacher imitation.

Stage 3: Targeted Remediation of Failure Cases

The final stage targets persistent weaknesses. The framework identifies specific, recurring failure cases in the student's reasoning. The model is then guided to "internalize" the correct teacher knowledge by learning to rewrite these flawed reasoning paths into correct ones. This targeted rewriting is again optimized using GRPO, ensuring efficient and focused learning on the most challenging concepts.

Experimental Results and Superior Performance

Experiments conducted on the GSM8K mathematical reasoning benchmark validate the framework's effectiveness. The student model, Qwen2.5-3B-Base, achieved an 11.29 percentage point improvement in accuracy. Crucially, it did so while generating outputs that were 27.4% shorter than those from standard training approaches. This dual achievement of higher accuracy and greater conciseness allowed it to surpass both standard instruction-tuned variants and previous state-of-the-art distillation methods.

Why This Matters for AI Development

Enables Efficient Deployment: It allows high-level reasoning capabilities to be packaged into smaller, faster, and cheaper models suitable for real-world applications and edge computing.
Preserves Interpretability: Unlike "black-box" compression, the method maintains step-by-step reasoning, which is critical for debugging, trust, and safety in high-stakes domains.
Optimizes the Accuracy-Brevity Trade-off: The framework provides a principled way for models to learn to be both correct and concise, reducing computational overhead for downstream tasks.
Introduces a New Training Paradigm: The staged curriculum using GRPO presents a novel blueprint for teaching complex cognitive skills to AI, moving beyond simple imitation learning.

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

New AI Training Method Enables Smaller Models to Master Complex Reasoning Efficiently

The Core Challenge: The Capacity Mismatch in Reasoning Distillation

The Three-Stage Curriculum: Progressive Skill Acquisition

Stage 1: Building Structural Understanding

Stage 2: Learning Brevity and Accuracy with GRPO

Stage 3: Targeted Remediation of Failure Cases

Experimental Results and Superior Performance

Why This Matters for AI Development

常见问题

New AI Training Method Enables Smaller Models to Master Complex Reasoning Efficiently

The Core Challenge: The Capacity Mismatch in Reasoning Distillation

The Three-Stage Curriculum: Progressive Skill Acquisition

Stage 1: Building Structural Understanding

Stage 2: Learning Brevity and Accuracy with GRPO

Stage 3: Targeted Remediation of Failure Cases

Experimental Results and Superior Performance

Why This Matters for AI Development

常见问题

相关推荐

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks

Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language

On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks