"DeepRAG: Thinking to Retrieval Step by Step for LLMs"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 09, 2025

Article voiceover

0:00

-6:58

https://arxiv.org/abs/2502.01142

The paper addresses the issue of factual hallucinations in Large Language Models (LLMs) and the challenges in Retrieval-Augmented Generation (RAG) due to ineffective task decomposition and unnecessary retrieval. This leads to noisy information and degraded response quality.

This paper introduces DeepRAG to solve this problem. DeepRAG models retrieval-augmented reasoning as a Markov Decision Process. It iteratively breaks down queries and dynamically decides at each step whether to retrieve external knowledge or rely on the LLM's internal knowledge.

-----

📌 DeepRAG's Markov Decision Process formulation enables step-by-step adaptive retrieval. This dynamic approach contrasts with static methods, optimizing retrieval only when necessary at each reasoning stage, enhancing efficiency.

📌 Binary tree search and Chain of Calibration in DeepRAG effectively calibrate Large Language Model's knowledge boundaries. This self-awareness mechanism minimizes unnecessary retrievals and hallucinations by better utilizing internal parameters.

📌 DeepRAG practically improves Retrieval-Augmented Generation accuracy by 21.99% while reducing retrieval costs. This method offers immediate gains in performance and efficiency for existing Large Language Models in question answering tasks.

----------

Methods Explored in this Paper 🔧:

→ DeepRAG is presented as a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP).

→ It introduces "retrieval narrative" to structure retrieval flow. This generates subqueries based on previous retrievals.

→ "Atomic decisions" are used to dynamically decide for each subquery whether to retrieve external knowledge or use the LLM's internal knowledge.

→ A binary tree search method is employed to explore different retrieval strategies for each subquery and their impact on the final answer.

→ Imitation learning is used to train the model on the optimal reasoning process with minimal retrieval cost. This helps the model learn effective retrieval patterns.

→ Chain of Calibration further refines the model's understanding of its knowledge boundaries. This enables more accurate atomic decisions regarding retrieval necessity.

-----

Key Insights 💡:

→ DeepRAG dynamically utilizes internal knowledge, leading to higher accuracy with fewer retrievals. This shows effective resource utilization.

→ Confidence-based methods like FLARE and DRAGIN show limited robustness across different datasets. Their performance is inconsistent.

→ Iterative retrieval methods like Auto-RAG often require numerous retrieval operations, increasing computational cost.

→ DeepRAG demonstrates superior relevance performance. This indicates it effectively identifies when retrieval is truly needed by exploring the LLM's knowledge boundary.

→ In some cases, relying solely on internal knowledge leads to poor performance. Conversely, relying only on retrieval is costly and may introduce irrelevant information. DeepRAG achieves better performance by adaptively choosing between internal and external knowledge.

→ DeepRAG effectively decomposes questions into subqueries. This minimizes redundant retrieval attempts and simplifies complex questions.

-----

Results 📊:

→ DeepRAG achieves a 21.99% improvement in answer accuracy compared to baseline methods, demonstrating enhanced effectiveness in retrieval-augmented reasoning.

→ DeepRAG-Imi achieves an average score of 44.60 in ablation studies on Imitation Learning, outperforming 'most' retrieval cost path selection (41.12) and 'random' path selection (40.56).

→ DeepRAG achieves an average score of 47.67 in ablation studies on Chain of Calibration, exceeding 'all-node' preference construction (45.30) and 'sentence-wise' preference construction (21.14).

→ On the 2WikiMultihopQA dataset, DeepRAG achieves an Exact Match (EM) score of 48.10 with an average of 1.09 retrievals, while DRAGIN (threshold 0.5) achieves an EM of 23.47 with 1.45 retrievals.

→ On the WebQuestions dataset, DeepRAG achieves an EM of 32.70 with an average of 0.28 retrievals, while DRAGIN (threshold 0.5) achieves an EM of 21.20 with 0.0 retrievals.

Rohan's Bytes

Discussion about this post