"If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs"

Playback speed

Share post at current time

0:00

Transcript

"If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 27, 2024

Model merging optimizes multi-task performance by recycling checkpoints that would normally be discarded.

This paper proposes a novel approach to optimize model merging at scale (~100B parameters) by recycling seemingly suboptimal checkpoints from different training runs to create better-performing merged models, minimizing performance tradeoffs across tasks without additional training.

-----

https://arxiv.org/abs/2412.04144

🤔 Original Problem:

→ Traditional model merging focuses on combining specialized expert models, but modern LLM development creates many multi-task checkpoints that exhibit performance tradeoffs.

→ These checkpoints are usually discarded as "failed experiments" despite potentially containing valuable information.

-----

🔧 Solution in this Paper:

→ The paper introduces an evolutionary optimization approach to find optimal merge weights for combining multiple checkpoints.

→ They use 16 different 100B parameter checkpoints from various training stages, including both Supervised Fine-tuning and Preference Optimization.

→ The merging strategy optimizes weights using CMA-ES algorithm to minimize performance tradeoffs across different tasks.

→ The method recycles checkpoints that would typically be discarded, enabling training-free optimization.

-----

💡 Key Insights:

→ Even seemingly poor-performing checkpoints can contribute meaningfully to optimal merges

→ Best merges tend to include almost all checkpoints with non-zero weights

→ Individual checkpoint performance doesn't predict its importance in the final merge

-----

📊 Results:

→ Achieved Pareto-optimal performance across multiple tasks without hurting held-out task performance

→ Search-optimized merges outperformed baselines by up to 2.2 average points

→ Maintained 64% prediction accuracy on test sets

Rohan's Bytes

"If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs"

Discussion about this video