This paper introduces WRPO (Weighted-Reward Preference Optimization), a novel method for fusing capabilities of multiple LLMs into a smaller target model without complex alignment procedures. WRPO uses preference optimization and progressive adaptation to smoothly transfer knowledge while addressing distribution shifts between source and target models.
-----
https://arxiv.org/abs/2412.03187
🤔 Original Problem:
→ Existing methods for combining multiple LLMs require complex vocabulary alignment and distribution matrix merging, which introduces noise and errors
→ Direct preference optimization struggles with distribution shifts when learning from heterogeneous LLMs
-----
🔧 Solution in this Paper:
→ WRPO introduces a progressive adaptation strategy that gradually shifts from target LLM responses to source LLM responses
→ It uses a fusion coefficient α to dynamically balance internal rewards between source and target LLM responses during training
→ The method constructs preference quadruples (x, y_ws, y_wt, y_l) containing preferred responses from both source and target LLMs
→ WRPO eliminates need for vocabulary alignment by learning implicitly through preference optimization
-----
💡 Key Insights:
→ Implicit fusion through preference optimization is more effective than explicit knowledge distillation
→ Progressive adaptation helps bridge distribution gaps between heterogeneous LLMs
→ Dynamic weighting of internal rewards enables smooth knowledge transfer
-----
📊 Results:
→ Achieved 55.9% win rate against GPT-4 on AlpacaEval-2 benchmark
→ Outperformed all source models despite using only 8B parameters
→ Surpassed existing fusion methods by 17.5 points on length-controlled metrics
→ Maintained 46.2% win rate on challenging Arena-Hard benchmark
Share this post