Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

A novel three-stage curriculum learning framework efficiently distills Chain-of-Thought (CoT) reasoning from large teacher models into smaller student models. The method uses masked shuffled reconstruction and Group Relative Policy Optimization (GRPO) to enable compact models to produce accurate, concise reasoning traces. This approach addresses the capacity mismatch in knowledge distillation, validated on the GSM8K mathematical reasoning dataset.

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

New Curriculum Learning Framework Distills Efficient Chain-of-Thought Reasoning into Small Models

A novel three-stage curriculum learning framework has been developed to solve a core problem in AI efficiency: distilling the verbose, multi-step reasoning of large language models into much smaller, capable student models. The research tackles the inherent capacity mismatch where smaller models struggle to replicate the lengthy Chain-of-Thought (CoT) rationales from teachers, a process crucial for maintaining interpretability in complex reasoning tasks. By employing a progressive skill acquisition strategy, the method enables compact models to produce accurate, concise, and faithful reasoning traces, marking a significant advance in knowledge distillation techniques.

The Challenge of Faithful Reasoning Distillation

Current approaches to transferring CoT reasoning from large teacher models to smaller students often force a trade-off. Methods that compress reasoning into a single step sacrifice the step-by-step interpretability that makes CoT valuable for tasks like math and logic. Conversely, asking a small model to directly mimic a long, teacher-generated rationale typically leads to failure, as the student lacks the capacity to process and reproduce such verbose sequences. This creates a bottleneck for deploying efficient, transparent reasoning models in resource-constrained environments.

A Three-Stage Progressive Learning Curriculum

The proposed framework addresses this through a structured, curriculum-based pipeline designed to build reasoning competency progressively.

The first stage, masked shuffled reconstruction, focuses on building a foundational structural understanding of reasoning sequences. The model learns to reconstruct correct reasoning chains from perturbed inputs, internalizing the logical flow without initially worrying about stylistic verbosity.

The second stage introduces Group Relative Policy Optimization (GRPO) on masked completion tasks. Here, the model is not merely imitating the teacher but actively learning to generate its own reasoning. GRPO allows the student to discover an optimal balance between reasoning accuracy and output brevity, rewarding it for being both correct and concise.

The third and final stage targets persistent failure modes. The framework identifies cases where the student's reasoning still diverges from the teacher's knowledge. It then guides the model through a targeted rewriting process, again optimized with GRPO, to internalize the correct reasoning patterns it has previously missed.

Demonstrated Performance on Mathematical Reasoning

Experiments conducted on the challenging GSM8K grade-school math dataset validate the framework's effectiveness. The approach enabled the Qwen2.5-3B-Base model—a compact 3-billion-parameter model—to achieve a substantial 11.29% accuracy improvement. Critically, it accomplished this while simultaneously reducing the average output length by 27.4% compared to learning from the original teacher rationales.

This dual achievement of higher accuracy and greater conciseness allowed the distilled student model to surpass the performance of both standard instruction-tuned variants and prior state-of-the-art distillation methods. The results demonstrate that smaller models can be taught not just to reason, but to reason efficiently and reliably.

Why This Matters for Efficient AI

This research provides a scalable blueprint for creating highly capable, interpretable reasoning models that do not require massive computational resources.

  • Enables On-Device AI: By creating small models that maintain robust CoT reasoning, complex AI applications can run locally on smartphones and edge devices, enhancing privacy and reducing latency.
  • Preserves Interpretability: The method maintains the step-by-step reasoning that is essential for debugging AI decisions and building trust in critical domains like healthcare and finance.
  • Reduces Operational Costs: Deploying a 3B-parameter model with strong reasoning skills is vastly more cost-effective than using a 100B+ parameter model, lowering the barrier for businesses and researchers.
  • Advances Knowledge Distillation: The curriculum learning and GRPO techniques offer a new paradigm for transferring complex capabilities, moving beyond simple imitation to guided skill acquisition.

常见问题