0:00
/
0:00
Transcript

"LongReward: Improving Long-context Large Language Models with AI Feedback"

The podcast on this paper is generated with Google's Illuminate.

LongReward enables LLMs to handle long documents by teaching them through AI feedback.

Teaching LLMs to read long stuff by letting AI be the teacher.

📚 https://arxiv.org/abs/2410.21252

🤖 Original Problem:

Current LLMs struggle with long-context tasks due to poor quality of synthesized training data. Getting reliable rewards for long-context responses remains unexplored, limiting the use of reinforcement learning to enhance model performance.

-----

🔧 Solution in this Paper:

• LongReward: Uses off-the-shelf LLMs to score model responses on 4 dimensions:

- Helpfulness: Checks query relevance and requirement fulfillment

- Logicality: Ensures internal consistency

- Faithfulness: Verifies information matches context

- Completeness: Confirms all key points coverage

• Implementation:

- Direct LLM scoring for helpfulness/logicality

- Breaks responses into factual statements for faithfulness check

- Extracts question-relevant info from context chunks for completeness

- Combines with Direct Preference Optimization (DPO) for training

-----

💡 Key Insights:

• First method to provide reliable rewards for long-context responses

• Novel long-context RL framework combining LongReward with DPO

• Can be combined with short-context DPO without performance degradation

• Improves both long and short instruction following abilities

-----

📊 Results:

• DPO models with LongReward outperformed SFT models by:

- 4.9% on Llama-3.1-8B

- 5.5% on GLM-4-9B

• 46% more wins against SFT baseline in human evaluation

• Higher FactScore showing improved faithfulness

• Better performance on MT-Bench and AlpacaEval2

Discussion about this video