Smaller models can teach better than larger ones in LLM instruction tuning
Breaking the bigger-is-better myth in LLM instruction tuning.
https://arxiv.org/abs/2411.07133
🎯 Original Problem:
The common assumption that larger or stronger LLMs make better teachers for instruction tuning of smaller models needs validation. Currently, top models like GPT-4 are widely used for generating responses in instruction tuning datasets without proper evidence of their effectiveness.
-----
🔧 Solution in this Paper:
→ Conducted extensive experiments with 5 base models and 20 response generators across 7 model families
→ Introduced Compatibility-Adjusted Reward (CAR) metric that considers both response quality and compatibility with base models
→ Measured compatibility through average loss of responses on the base model being fine-tuned
→ Implemented greedy decoding and evaluated using AlpacaEval 2 and Arena-Hard benchmarks
-----
💡 Key Insights:
→ Larger models aren't necessarily better teachers - termed as "Larger Models' Paradox"
→ Open-source models like Gemma-2-9b-it outperform GPT-4 as response generators
→ Models from same family show better compatibility as teachers
→ Higher temperature and top-p sampling enhance instruction-following capabilities
-----
📊 Results:
→ Gemma-2-9b-it and Qwen2.5-72B-Instruct emerged as best response generators
→ CAR metric outperformed all baseline metrics in predicting teacher effectiveness
→ All tested open-source LLMs significantly outperformed GPT-4 in instruction tuning
Share this post