0:00
/
0:00
Transcript

"T-REG: Preference Optimization with Token-Level Reward Regularization"

The podcast on this paper is generated with Google's Illuminate.

T-REG: Making AI feedback precise down to each word, not just the whole response.

T-REG introduces token-level reward regularization to enhance LLM preference optimization, combining sequence-level and token-level rewards through self-generated feedback, improving model alignment by up to 4.4% on benchmarks.

https://arxiv.org/abs/2412.02685

🎯 Original Problem:

→ Current LLM preference optimization relies on sequence-level rewards, making it hard to identify which specific parts of a response contribute to its quality.

→ Existing token-level methods depend on external annotators or credit assignment models, raising reliability concerns.

-----

🔧 Solution in this Paper:

→ T-REG introduces token-level reward regularization that combines sequence-level and token-level rewards.

→ The method uses contrastive prompting to enable LLMs to generate their own token-level rewards.

→ These self-generated rewards act as regularization during preference optimization, guiding better token-level credit assignment.

-----

💡 Key Insights:

→ LLMs can effectively self-generate token-level rewards through contrastive prompting

→ Combining sequence and token-level optimization leads to better alignment

→ Self-generated rewards perform better than model-derived rewards

-----

📊 Results:

→ Improved length-controlled win rate by 24.8% over SFT baseline

→ Outperformed DPO by 3.8% on Alpaca Eval 2 benchmark

→ Achieved 4.4% better performance on Arena-Hard benchmark

→ Demonstrated consistent improvements across different preference optimization methods

Discussion about this video