Small models learn better reasoning by having a dedicated mentor between them and their LLM teacher
Middle-sized models act as mentors and improves reasoning capabilities of smaller models through augmented distillation and soft labels.
📚 https://arxiv.org/abs/2410.09037
Original Problem 🔍:
Reasoning distillation from LLMs to smaller language models faces challenges with insufficient distillation sets, limiting data quality and soft label provision.
-----
Mentor-KD framework: 🧠:
• Introduces a task-specific mentor model to complement LLM teacher knowledge
• Generates additional high-quality Chain-of-Thought (CoT) annotations
• Provides soft labels for student model during reasoning distillation
• Three-step process: CoT annotation generation, mentor model training, and reasoning distillation
-----
Key Insights from this Paper 💡:
• Task-specific mentors can effectively augment limited LLM teacher distillation sets
• Combining rationale distillation and soft label distillation improves student performance
• Mentor-KD is effective across various reasoning tasks and model sizes
• The framework shows robustness in low-resource scenarios
-----
Results 📊:
• Mentor-KD outperforms baselines across complex reasoning tasks
• 2% average accuracy improvement over previous SOTA (MCC-KD)
• Significant gains in commonsense and logical reasoning tasks
• Student models sometimes outperform LLM teachers
• Effective with only 40% of original distillation sets, demonstrating cost-efficiency