Can these new long context models improve RAG performance?
This paper finds, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens.
https://arxiv.org/abs/2411.03538
Original Problem 🤔:
Retrieval Augmented Generation (RAG) enhances LLM accuracy with external information. As LLMs support longer contexts, understanding their RAG performance becomes crucial.
-----
Solution in this Paper 🔍:
→ The study evaluates RAG performance of 20 LLMs with varying context lengths from 2,000 to 128,000 tokens (and 2 million when possible).
→ It uses three domain-specific datasets: Databricks DocsQA, FinanceBench, and Natural Questions.
→ The researchers employ a standard RAG approach, retrieving document chunks using OpenAI's text-embedding-3-large model and FAISS vector store.
→ They analyze failure patterns for selected models in long context scenarios using GPT-4o as a judge.
-----
Key Insights from this Paper 💡:
→ Only recent state-of-the-art LLMs show consistent RAG performance improvement with longer contexts up to 128k tokens.
→ Most models, especially open-source ones, decline in performance beyond 16k-32k tokens.
→ LLMs fail at long context RAG in unique ways, including refusal due to perceived copyright concerns and overly sensitive safety filters.
→ Using very long contexts for RAG is significantly more expensive than traditional vector database retrieval.
-----
Results 📊:
→ OpenAI's o1 models demonstrate superior performance, consistently improving up to 100k tokens.
→ Gemini 1.5 models maintain stable performance at 2 million tokens, though with lower overall accuracy.
→ Open-source models like Llama 3.1 405B show performance decline after 32k tokens.
→ Cost per query (128k tokens): GPT-4o ($0.32), o1-preview ($1.92), Claude 3.5 Sonnet ($0.384), Gemini 1.5 Pro ($0.167).
Share this post