RAG-RewardBench introduces a benchmark for evaluating reward models in Retrieval Augmented Generation (RAG) settings to improve alignment with human preferences.
https://arxiv.org/abs/2412.13746
🔍 Methods in this Paper:
→ RAG-RewardBench designs four crucial RAG-specific scenarios: multi-hop reasoning, fine-grained citation, appropriate abstain, and conflict robustness.
→ It incorporates 18 RAG subsets, six retrievers, and 24 RALMs to increase data source diversity.
→ The benchmark adopts an LLM-as-a-judge approach for efficient preference annotation, showing strong correlation with human annotations.
→ RAG-RewardBench includes 1,485 high-quality preference pairs to facilitate RALM alignment.
-----
💡 Key Insights from this Paper:
→ Top-ranked RM achieves only 78.3% accuracy, highlighting the benchmark's challenge
→ Generative/discriminative RMs with 27B-70B parameters perform best
→ Current RALMs show minimal improvement in preference alignment (+0.6%)
→ Strong correlation between benchmark performance and downstream RAG tasks
Share this post