LongReward enables LLMs to handle long documents by teaching them through AI feedback.
Teaching LLMs to read long stuff by letting AI be the teacher.
📚 https://arxiv.org/abs/2410.21252
🤖 Original Problem:
Current LLMs struggle with long-context tasks due to poor quality of synthesized training data. Getting reliable rewards for long-context responses remains unexplored, limiting the use of reinforcement learning to enhance model performance.
-----
🔧 Solution in this Paper:
• LongReward: Uses off-the-shelf LLMs to score model responses on 4 dimensions:
- Helpfulness: Checks query relevance and requirement fulfillment
- Logicality: Ensures internal consistency
- Faithfulness: Verifies information matches context
- Completeness: Confirms all key points coverage
• Implementation:
- Direct LLM scoring for helpfulness/logicality
- Breaks responses into factual statements for faithfulness check
- Extracts question-relevant info from context chunks for completeness
- Combines with Direct Preference Optimization (DPO) for training
-----
💡 Key Insights:
• First method to provide reliable rewards for long-context responses
• Novel long-context RL framework combining LongReward with DPO
• Can be combined with short-context DPO without performance degradation
• Improves both long and short instruction following abilities
-----
📊 Results:
• DPO models with LongReward outperformed SFT models by:
- 4.9% on Llama-3.1-8B
- 5.5% on GLM-4-9B
• 46% more wins against SFT baseline in human evaluation
• Higher FactScore showing improved faithfulness
• Better performance on MT-Bench and AlpacaEval2
Share this post