Commercial fine-tuning APIs struggle to teach LLMs new tricks, with only 37% success rate
This paper evaluates how well commercial fine-tuning APIs from OpenAI and Google actually work for teaching new knowledge to LLMs. Through systematic testing across multiple domains, it reveals significant limitations in current fine-tuning services for reliable knowledge infusion.
-----
https://arxiv.org/abs/2411.05059
🤔 Original Problem:
Companies offer fine-tuning APIs to customize closed-source LLMs, but their effectiveness for teaching new information remains unclear. Users have limited control over fine-tuning parameters and lack benchmarks to evaluate these services.
-----
🔧 Solution in this Paper:
→ The researchers created FineTuneBench, a framework with 625 training and 1,075 test questions across news, fictional profiles, medical guidelines and code updates.
→ They tested five commercial models: GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Gemini 1.5 Pro and Gemini 1.5 Flash.
→ The evaluation tested both memorization of training data and generalization to rephrased questions.
→ They experimented with different training techniques like direct QA pairs, masked sentences, and fact completion.
-----
💡 Key Insights:
→ Models can memorize training data but struggle to generalize knowledge
→ Updating existing knowledge is harder than learning new information
→ OpenAI models significantly outperform Gemini models
→ Smaller models like GPT-4o-mini show better knowledge retention
-----
📊 Results:
→ Average generalization accuracy: 37% for new knowledge, 19% for updating existing knowledge
→ GPT-4o-mini performed best, followed by GPT-3.5-turbo and GPT-4o
→ Gemini models showed negligible learning with <5% accuracy
→ Performance plateaus after 10-20 training epochs
Share this post