"Long Context RAG Performance of Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"Long Context RAG Performance of Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 29, 2024

Can these new long context models improve RAG performance?

This paper finds, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens.

https://arxiv.org/abs/2411.03538

Original Problem 🤔:

Retrieval Augmented Generation (RAG) enhances LLM accuracy with external information. As LLMs support longer contexts, understanding their RAG performance becomes crucial.

-----

Solution in this Paper 🔍:

→ The study evaluates RAG performance of 20 LLMs with varying context lengths from 2,000 to 128,000 tokens (and 2 million when possible).

→ It uses three domain-specific datasets: Databricks DocsQA, FinanceBench, and Natural Questions.

→ The researchers employ a standard RAG approach, retrieving document chunks using OpenAI's text-embedding-3-large model and FAISS vector store.

→ They analyze failure patterns for selected models in long context scenarios using GPT-4o as a judge.

-----

Key Insights from this Paper 💡:

→ Only recent state-of-the-art LLMs show consistent RAG performance improvement with longer contexts up to 128k tokens.

→ Most models, especially open-source ones, decline in performance beyond 16k-32k tokens.

→ LLMs fail at long context RAG in unique ways, including refusal due to perceived copyright concerns and overly sensitive safety filters.

→ Using very long contexts for RAG is significantly more expensive than traditional vector database retrieval.

-----

Results 📊:

→ OpenAI's o1 models demonstrate superior performance, consistently improving up to 100k tokens.

→ Gemini 1.5 models maintain stable performance at 2 million tokens, though with lower overall accuracy.

→ Open-source models like Llama 3.1 405B show performance decline after 32k tokens.

→ Cost per query (128k tokens): GPT-4o ($0.32), o1-preview ($1.92), Claude 3.5 Sonnet ($0.384), Gemini 1.5 Pro ($0.167).

Rohan's Bytes

"Long Context RAG Performance of Large Language Models"

Discussion about this video