Merging instruction-tuned models at scale yields superior performance and generalization across diverse tasks.
📚 https://arxiv.org/pdf/2410.03617
Original Problem 🔍:
Model merging combines expert models to create a unified model with enhanced capabilities. Previous studies focused on small models and limited merging scenarios, leaving questions about scalability unanswered.
-----
Solution in this Paper 🧪:
• Systematic evaluation of model merging at scale (1B to 64B parameters)
• Used PaLM-2 and PaLM-2-IT models as base models
• Created expert models via fine-tuning on 8 held-in task categories
• Tested 4 merging methods: Averaging, Task Arithmetic, Dare-TIES, TIES-Merging
• Varied number of expert models merged (2 to 8)
• Evaluated on held-in tasks and 4 held-out task categories for zero-shot generalization
-----
Key Insights from this Paper 💡:
• Instruction-tuned base models facilitate easier merging
• Larger models merge more effectively
• Merged models show improved zero-shot generalization
• Merging methods perform similarly for large instruction-tuned models
-----
Results 📊:
• PaLM-2-IT consistently outperformed PaLM-2 across all settings
• 64B merged models approached task-specific expert performance (normalized score: 0.97)
• Merged 24B+ PaLM-2-IT models surpassed multitask baselines on held-out tasks
• 64B PaLM-2-IT merged model improved held-out performance by 18% over base model
Share this post