0:00
/
0:00
Transcript

"CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation"

Below podcast is generated with Google's Illuminate.

Confidence-Reward driven Preference Optimization (CRPO) boosts machine translation by selecting data where the model struggles most, leading to more effective learning.

-----

Paper - https://arxiv.org/abs/2501.13927

Original Problem 🤔:

→ Existing preference optimization methods for LLMs in machine translation mainly focus on reward values for data selection.

→ They often overlook the model's confidence, which is crucial for effective learning.

-----

Solution in this Paper 💡:

→ This paper introduces Confidence-Reward driven Preference Optimization (CRPO).

→ CRPO incorporates both reward scores and model confidence for preference data selection.

→ It prioritizes challenging sentences where the model is uncertain or underperforms, leading to more effective learning.

→ Two CR-Score formulations are presented: CR+ (based on loss change) and CR× (based on loss value). Both utilize reward difference and reference policy likelihood difference, but CR+ uses addition and CR× uses multiplication, simplifying hyperparameter tuning in CR×.

-----

Key Insights from this Paper 🧐:

→ Jointly considering reward and model confidence is crucial for efficient data selection in preference optimization.

→ Selecting challenging examples maximizes the learning potential from each data point.

→ CRPO's applicability extends to both decoder-only LLMs and encoder-decoder models.

-----

Results ✨:

→ CRPO outperforms existing methods like RSO, RS-DPO, and MBR Score in translation accuracy and data efficiency, as shown by higher COMET and BLEURT scores averaged across 10 translation directions.

→ For example, on ALMA-7B, CRPO+ achieves average COMET22 score of 0.9311, compared to RSO's 0.9277 and RS-DPO-1’s 0.9189.

→ CRPO's effectiveness is demonstrated on both decoder-only LLMs and encoder-decoder models like NLLB.

Discussion about this video