Synthetic preferences help LLMs understand when humans naturally disagree on what's "better"
This paper addresses the challenge of binary preference tuning in LLMs, which fails to capture diverse user preferences in real-world deployments. It proposes a margin-based regularization technique using synthetic preference judgments to better align models with aggregate user preferences.
-----
https://arxiv.org/abs/2412.03822
🤔 Original Problem:
Current LLM preference tuning relies on binary choices from single annotators, which doesn't reflect the diverse preferences of millions of real-world users, especially for subjective tasks like safety, toxicity, and output quality assessment.
-----
🔧 Solution in this Paper:
→ The paper introduces a taxonomy identifying two key dimensions of subjectivity: Plurality of Responses and Indistinguishability of Responses.
→ It proposes augmenting binary preference datasets with synthetic preference judgments to estimate potential user disagreement.
→ A margin-based regularization term is incorporated into the reward model training objective, scaled according to estimated disagreement levels.
→ The margin term converts binary preferences into a cardinal measure of group preference strength.
-----
💡 Key Insights:
→ Reward models correlate weakly with user preferences in subjective cases where different users disagree
→ Single-annotator binary judgments are unreliable signals for diverse user bases
→ Synthetic annotations can effectively approximate preference heterogeneity
→ Margin-based regularization improves alignment with aggregate preferences
-----
📊 Results:
→ Improved correlation between reward model predictions and human preferences in subjective examples
→ Better performance on both in-domain and out-of-domain datasets
→ Enhanced calibration of preference signals from single-annotator to multi-user environments
Share this post