Model merging optimizes multi-task performance by recycling checkpoints that would normally be discarded.
This paper proposes a novel approach to optimize model merging at scale (~100B parameters) by recycling seemingly suboptimal checkpoints from different training runs to create better-performing merged models, minimizing performance tradeoffs across tasks without additional training.
-----
https://arxiv.org/abs/2412.04144
🤔 Original Problem:
→ Traditional model merging focuses on combining specialized expert models, but modern LLM development creates many multi-task checkpoints that exhibit performance tradeoffs.
→ These checkpoints are usually discarded as "failed experiments" despite potentially containing valuable information.
-----
🔧 Solution in this Paper:
→ The paper introduces an evolutionary optimization approach to find optimal merge weights for combining multiple checkpoints.
→ They use 16 different 100B parameter checkpoints from various training stages, including both Supervised Fine-tuning and Preference Optimization.
→ The merging strategy optimizes weights using CMA-ES algorithm to minimize performance tradeoffs across different tasks.
→ The method recycles checkpoints that would typically be discarded, enabling training-free optimization.
-----
💡 Key Insights:
→ Even seemingly poor-performing checkpoints can contribute meaningfully to optimal merges
→ Best merges tend to include almost all checkpoints with non-zero weights
→ Individual checkpoint performance doesn't predict its importance in the final merge
-----
📊 Results:
→ Achieved Pareto-optimal performance across multiple tasks without hurting held-out task performance
→ Search-optimized merges outperformed baselines by up to 2.2 average points
→ Maintained 64% prediction accuracy on test sets
Share this post