0:00
/
0:00
Transcript

"Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference"

Generated below podcast on this paper with Google's Illuminate.

Memory-efficient RAG or fast RAG? This paper shows you can't have both.

This paper systematically analyzes performance trade-offs in RAG systems, revealing critical bottlenecks in latency, memory usage, and throughput at scale.

https://arxiv.org/abs/2412.11854

🛠️ Methods in this Paper:

→ The paper creates a detailed taxonomy of RAG systems examining retrieval algorithms, integration methods, and runtime parameters.

→ It builds an extensible framework using open-source components to evaluate trade-offs between latency, throughput, storage, and accuracy.

→ The study focuses on dense retrieval methods using HNSW and IVF indices, analyzing their performance with varied batch sizes and datastore scales.

-----

💡 Key Insights:

→ RAG doubles Time-To-First-Token latency from 495ms to 965ms compared to standard LLM inference

→ Memory-efficient retrieval needs 2.3x less DRAM but achieves only 0.65 recall vs 0.95 for memory-intensive methods

→ Datastore scaling from 1M to 100M chunks degrades throughput by 20x

→ Aggressive retrieval striding increases latency to 30 seconds, making it impractical for production

-----

📊 Results:

→ Retrieval comprises 41% of end-to-end latencies and 45-47% of TTFT latencies

→ HNSW-SQ achieves 0.87 recall but requires 166GB storage

→ IVF-PQ uses only 23GB storage but drops to 0.61 recall

→ Billion-scale datastores demand terabyte-scale memory

Discussion about this video

User's avatar