0:00
/
0:00
Transcript

"Weighted-Reward Preference Optimization for Implicit Model Fusion"

The podcast on this paper is generated with Google's Illuminate.

This paper introduces WRPO (Weighted-Reward Preference Optimization), a novel method for fusing capabilities of multiple LLMs into a smaller target model without complex alignment procedures. WRPO uses preference optimization and progressive adaptation to smoothly transfer knowledge while addressing distribution shifts between source and target models.

-----

https://arxiv.org/abs/2412.03187

🤔 Original Problem:

→ Existing methods for combining multiple LLMs require complex vocabulary alignment and distribution matrix merging, which introduces noise and errors

→ Direct preference optimization struggles with distribution shifts when learning from heterogeneous LLMs

-----

🔧 Solution in this Paper:

→ WRPO introduces a progressive adaptation strategy that gradually shifts from target LLM responses to source LLM responses

→ It uses a fusion coefficient α to dynamically balance internal rewards between source and target LLM responses during training

→ The method constructs preference quadruples (x, y_ws, y_wt, y_l) containing preferred responses from both source and target LLMs

→ WRPO eliminates need for vocabulary alignment by learning implicitly through preference optimization

-----

💡 Key Insights:

→ Implicit fusion through preference optimization is more effective than explicit knowledge distillation

→ Progressive adaptation helps bridge distribution gaps between heterogeneous LLMs

→ Dynamic weighting of internal rewards enables smooth knowledge transfer

-----

📊 Results:

→ Achieved 55.9% win rate against GPT-4 on AlpacaEval-2 benchmark

→ Outperformed all source models despite using only 8B parameters

→ Surpassed existing fusion methods by 17.5 points on length-controlled metrics

→ Maintained 46.2% win rate on challenging Arena-Hard benchmark

Discussion about this video