Inference Scaling for Long-Context Retrieval Augmented Generation
Solves RAG performance plateau by optimizing computation allocation with Inference scaling
Nice @GoogleDeepMind Paper
Solves RAG performance plateau by optimizing computation allocation with Inference scaling
๐ Achieves 58.9% performance gain over standard RAG approaches.
๐ Shows linear scaling between computation and RAG performance.
Original Problem ๐:
RAG performance plateaus with increasing context length due to ineffective utilization of knowledge and LLMs' limited ability to process ultra-long sequences.
Solution in this Paper ๐ง :
โข DRAG: Incorporates extensive documents and in-context examples
โข IterDRAG: Decomposes complex queries into sub-queries with interleaved retrieval
โข Computation allocation model: Predicts optimal inference parameters for RAG
โข Inference scaling laws: Quantifies relationship between RAG performance and computation
Key Insights from this Paper ๐ก:
โข RAG performance scales almost linearly with increasing inference computation when optimally allocated
โข DRAG excels with shorter context lengths, while IterDRAG scales better for longer contexts
โข Performance gains diminish beyond 1M tokens, suggesting limitations in long-context modeling
โข Optimal configurations can be predicted using the computation allocation model
Results ๐:
โข Up to 58.9% performance gains on benchmark datasets compared to standard RAG
โข DRAG and IterDRAG consistently outperform baselines across various tasks
โข Computation allocation model achieves 96.6% of optimal performance when generalizing to unseen domains
โข Accurate predictions for target lengths below 1M tokens when extrapolating to longer context lengths
This paper explores how to effectively scale up inference computation for RAG tasks using long-context large language models (LLMs). The authors introduce two main strategies:
Demonstration-based RAG (DRAG): This approach incorporates both extensive retrieved documents and in-context examples to utilize the capabilities of long-context LLMs.
Iterative demonstration-based RAG (IterDRAG): This method decomposes complex queries into simpler sub-queries and uses interleaved retrieval and generation steps to construct reasoning chains.
Evaluation accuracy of Gemini 1.5 Flash using different methods: zero-shot QA, many-shot QA, RAG (with an optimal number of documents), DRAG and IterDRAG on benchmark QA datasets



