0:00
/
0:00
Transcript

Mentor-KD: Making Small Language Models Better Multi-step Reasoners

Generated this podcast with Google's Illuminate.

Small models learn better reasoning by having a dedicated mentor between them and their LLM teacher

Middle-sized models act as mentors and improves reasoning capabilities of smaller models through augmented distillation and soft labels.

📚 https://arxiv.org/abs/2410.09037

Original Problem 🔍:

Reasoning distillation from LLMs to smaller language models faces challenges with insufficient distillation sets, limiting data quality and soft label provision.

-----

Mentor-KD framework: 🧠:

• Introduces a task-specific mentor model to complement LLM teacher knowledge

• Generates additional high-quality Chain-of-Thought (CoT) annotations

• Provides soft labels for student model during reasoning distillation

• Three-step process: CoT annotation generation, mentor model training, and reasoning distillation

-----

Key Insights from this Paper 💡:

• Task-specific mentors can effectively augment limited LLM teacher distillation sets

• Combining rationale distillation and soft label distillation improves student performance

• Mentor-KD is effective across various reasoning tasks and model sizes

• The framework shows robustness in low-resource scenarios

-----

Results 📊:

• Mentor-KD outperforms baselines across complex reasoning tasks

• 2% average accuracy improvement over previous SOTA (MCC-KD)

• Significant gains in commonsense and logical reasoning tasks

• Student models sometimes outperform LLM teachers

• Effective with only 40% of original distillation sets, demonstrating cost-efficiency

Discussion about this video

User's avatar