0:00
/
0:00
Transcript

"Stronger Models are NOT Stronger Teachers for Instruction Tuning"

The podcast on this paper is generated with Google's Illuminate.

Smaller models can teach better than larger ones in LLM instruction tuning

Breaking the bigger-is-better myth in LLM instruction tuning.

https://arxiv.org/abs/2411.07133

🎯 Original Problem:

The common assumption that larger or stronger LLMs make better teachers for instruction tuning of smaller models needs validation. Currently, top models like GPT-4 are widely used for generating responses in instruction tuning datasets without proper evidence of their effectiveness.

-----

🔧 Solution in this Paper:

→ Conducted extensive experiments with 5 base models and 20 response generators across 7 model families

→ Introduced Compatibility-Adjusted Reward (CAR) metric that considers both response quality and compatibility with base models

→ Measured compatibility through average loss of responses on the base model being fine-tuned

→ Implemented greedy decoding and evaluated using AlpacaEval 2 and Arena-Hard benchmarks

-----

💡 Key Insights:

→ Larger models aren't necessarily better teachers - termed as "Larger Models' Paradox"

→ Open-source models like Gemma-2-9b-it outperform GPT-4 as response generators

→ Models from same family show better compatibility as teachers

→ Higher temperature and top-p sampling enhance instruction-following capabilities

-----

📊 Results:

→ Gemma-2-9b-it and Qwen2.5-72B-Instruct emerged as best response generators

→ CAR metric outperformed all baseline metrics in predicting teacher effectiveness

→ All tested open-source LLMs significantly outperformed GPT-4 in instruction tuning

Discussion about this video