Memory-efficient RAG or fast RAG? This paper shows you can't have both.
This paper systematically analyzes performance trade-offs in RAG systems, revealing critical bottlenecks in latency, memory usage, and throughput at scale.
https://arxiv.org/abs/2412.11854
🛠️ Methods in this Paper:
→ The paper creates a detailed taxonomy of RAG systems examining retrieval algorithms, integration methods, and runtime parameters.
→ It builds an extensible framework using open-source components to evaluate trade-offs between latency, throughput, storage, and accuracy.
→ The study focuses on dense retrieval methods using HNSW and IVF indices, analyzing their performance with varied batch sizes and datastore scales.
-----
💡 Key Insights:
→ RAG doubles Time-To-First-Token latency from 495ms to 965ms compared to standard LLM inference
→ Memory-efficient retrieval needs 2.3x less DRAM but achieves only 0.65 recall vs 0.95 for memory-intensive methods
→ Datastore scaling from 1M to 100M chunks degrades throughput by 20x
→ Aggressive retrieval striding increases latency to 30 seconds, making it impractical for production
-----
📊 Results:
→ Retrieval comprises 41% of end-to-end latencies and 45-47% of TTFT latencies
→ HNSW-SQ achieves 0.87 recall but requires 166GB storage
→ IVF-PQ uses only 23GB storage but drops to 0.61 recall
→ Billion-scale datastores demand terabyte-scale memory