T-REG: Making AI feedback precise down to each word, not just the whole response.
T-REG introduces token-level reward regularization to enhance LLM preference optimization, combining sequence-level and token-level rewards through self-generated feedback, improving model alignment by up to 4.4% on benchmarks.
https://arxiv.org/abs/2412.02685
🎯 Original Problem:
→ Current LLM preference optimization relies on sequence-level rewards, making it hard to identify which specific parts of a response contribute to its quality.
→ Existing token-level methods depend on external annotators or credit assignment models, raising reliability concerns.
-----
🔧 Solution in this Paper:
→ T-REG introduces token-level reward regularization that combines sequence-level and token-level rewards.
→ The method uses contrastive prompting to enable LLMs to generate their own token-level rewards.
→ These self-generated rewards act as regularization during preference optimization, guiding better token-level credit assignment.
-----
💡 Key Insights:
→ LLMs can effectively self-generate token-level rewards through contrastive prompting
→ Combining sequence and token-level optimization leads to better alignment
→ Self-generated rewards perform better than model-derived rewards
-----
📊 Results:
→ Improved length-controlled win rate by 24.8% over SFT baseline
→ Outperformed DPO by 3.8% on Alpaca Eval 2 benchmark
→ Achieved 4.4% better performance on Arena-Hard benchmark
→ Demonstrated consistent improvements across different preference optimization methods
Share this post