0:00
/
0:00
Transcript

"FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?"

The podcast on this paper is generated with Google's Illuminate.

Commercial fine-tuning APIs struggle to teach LLMs new tricks, with only 37% success rate

This paper evaluates how well commercial fine-tuning APIs from OpenAI and Google actually work for teaching new knowledge to LLMs. Through systematic testing across multiple domains, it reveals significant limitations in current fine-tuning services for reliable knowledge infusion.

-----

https://arxiv.org/abs/2411.05059

🤔 Original Problem:

Companies offer fine-tuning APIs to customize closed-source LLMs, but their effectiveness for teaching new information remains unclear. Users have limited control over fine-tuning parameters and lack benchmarks to evaluate these services.

-----

🔧 Solution in this Paper:

→ The researchers created FineTuneBench, a framework with 625 training and 1,075 test questions across news, fictional profiles, medical guidelines and code updates.

→ They tested five commercial models: GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Gemini 1.5 Pro and Gemini 1.5 Flash.

→ The evaluation tested both memorization of training data and generalization to rephrased questions.

→ They experimented with different training techniques like direct QA pairs, masked sentences, and fact completion.

-----

💡 Key Insights:

→ Models can memorize training data but struggle to generalize knowledge

→ Updating existing knowledge is harder than learning new information

→ OpenAI models significantly outperform Gemini models

→ Smaller models like GPT-4o-mini show better knowledge retention

-----

📊 Results:

→ Average generalization accuracy: 37% for new knowledge, 19% for updating existing knowledge

→ GPT-4o-mini performed best, followed by GPT-3.5-turbo and GPT-4o

→ Gemini models showed negligible learning with <5% accuracy

→ Performance plateaus after 10-20 training epochs

Discussion about this video