ReARTER improves multi-step reasoning in Retrieval-Augmented Generation by using Trustworthy Process Rewarding to guide both post-training and test-time scaling.
This enhances reasoning path quality and refinement accuracy.
-----
https://arxiv.org/abs/2501.07861
Original Problem: 🤔:
→ Existing Retrieval-Augmented Generation (RAG) systems struggle with complex multi-step reasoning, especially with providing explanations, handling biased training data, and optimizing reasoning potential.
-----
Solution in this Paper:💡:
→ During training, ReARTER uses Monte Carlo Tree Search, guided by Trustworthy Process Rewarding, to optimize the model.
→ During testing, ReARTER uses a Process Reward Model (PRM) for scoring and a Process Explanation Model (PEM) for refining steps.
→ ReARTER aligns PEM and PRM, mitigates training data bias, and resolves early-step bias in PRM scores.
-----
Key Insights from this Paper:👍:
→ Combining post-training and test-time scaling significantly improves RAG reasoning performance.
→ Trustworthy Process Rewarding improves reasoning path quality during post-training and search accuracy during testing.
→ Addressing the untrustworthy challenges of PRMs is crucial for effective multi-step reasoning.
-----
Results:
→ ReARTER significantly outperforms baselines across multiple multi-step reasoning benchmarks using both GPT4-o-mini and LLaMA3.1-8B generators.
→ Ablation studies demonstrate the importance of each component, particularly unbiased PRM data and TD-based look-ahead search.
→ Alignment of PEM and PRM significantly improves the refinement process and reasoning quality, demonstrated by increased improvement rate of process reward scores after refinement (from approximately 50% to over 70%) and overall accuracy increase on multiple datasets.
Share this post