Confidence-Reward driven Preference Optimization (CRPO) boosts machine translation by selecting data where the model struggles most, leading to more effective learning.
-----
Paper - https://arxiv.org/abs/2501.13927
Original Problem 🤔:
→ Existing preference optimization methods for LLMs in machine translation mainly focus on reward values for data selection.
→ They often overlook the model's confidence, which is crucial for effective learning.
-----
Solution in this Paper 💡:
→ This paper introduces Confidence-Reward driven Preference Optimization (CRPO).
→ CRPO incorporates both reward scores and model confidence for preference data selection.
→ It prioritizes challenging sentences where the model is uncertain or underperforms, leading to more effective learning.
→ Two CR-Score formulations are presented: CR+ (based on loss change) and CR× (based on loss value). Both utilize reward difference and reference policy likelihood difference, but CR+ uses addition and CR× uses multiplication, simplifying hyperparameter tuning in CR×.
-----
Key Insights from this Paper 🧐:
→ Jointly considering reward and model confidence is crucial for efficient data selection in preference optimization.
→ Selecting challenging examples maximizes the learning potential from each data point.
→ CRPO's applicability extends to both decoder-only LLMs and encoder-decoder models.
-----
Results ✨:
→ CRPO outperforms existing methods like RSO, RS-DPO, and MBR Score in translation accuracy and data efficiency, as shown by higher COMET and BLEURT scores averaged across 10 translation directions.
→ For example, on ALMA-7B, CRPO+ achieves average COMET22 score of 0.9311, compared to RSO's 0.9277 and RS-DPO-1’s 0.9189.
→ CRPO's effectiveness is demonstrated on both decoder-only LLMs and encoder-decoder models like NLLB.
Share this post