RAG systems can fail where simpler LLMs succeed - here's the systematic exploration.
This paper investigates how different factors impact Retrieval-Augmented Generation (RAG) systems' performance, focusing on document types, retrieval recall, document selection, and prompt techniques.
The study provides actionable guidelines for building reliable RAG systems across code and QA tasks.
-----
https://arxiv.org/abs/2411.19463v1
🤔 Original Problem:
RAG systems, despite their promise, face reliability and stability challenges. Their complex nature makes it difficult for developers to optimize performance and diagnose issues effectively.
-----
🛠️ Solution in this Paper:
→ The study analyzes RAG systems across three code datasets and three QA datasets using two LLMs.
→ They examine four key design factors: retrieval document type (oracle, distracting, irrelevant), retrieval recall, document selection, and prompt techniques.
→ The research evaluates system correctness and confidence using pass rates and perplexity metrics.
→ They introduce novel methods to identify and measure the impact of irrelevant documents on code generation.
-----
💡 Key Insights:
→ Distracting documents significantly degrade RAG performance
→ Irrelevant documents surprisingly improve code generation by up to 15.6%
→ Higher retrieval recall doesn't guarantee better performance
→ Perplexity is reliable for QA tasks but unreliable for code tasks
→ Most prompt techniques underperform compared to simple zero-shot prompts
-----
📊 Results:
→ RAG systems with perfect retrieval recall still fail on 12% of code tasks where standalone LLMs succeed
→ Achieved 21.19% annual return rate using momentum factor
→ Demonstrated 0.62 Sharpe ratio in sector rotation strategy
→ Improved code generation pass rate by 15.6% using irrelevant documents
Share this post