Improving Retrieval Accuracy in RAG Systems

Apr 13, 2025

Browse all previoiusly published AI Tutorials here.

Sparse Retrieval (BM25 and Variants)
Dense Vector Retrieval
Hybrid Retrieval Approaches
Query Refinement and Multi-Hop Retrieval
LLM-Based Ranking and Retrieval Feedback
In the context of Advanced Search Algorithms in LLMs, Consider a scenario where a client has already built a RAG-based system that is not giving accurate results, upon investigation you find out that the retrieval system is not accurate, what steps you will take to improve it?

Retrieval-Augmented Generation (RAG) integrates external knowledge retrieval into large language model (LLM) outputs to reduce hallucinations and improve factuality (LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers). The reliability of RAG heavily depends on the accuracy of its retrieval component – if irrelevant documents are retrieved, even a strong LLM may generate incorrect answers (Corrective Retrieval Augmented Generation | OpenReview). Recent research (2024–2025) has focused on enhancing retrieval algorithms for RAG, ranging from classical sparse methods like BM25 to dense and hybrid retrieval, as well as novel techniques for query rewriting, multi-hop search, re-ranking, and self-correction. Below, we review these methods and key findings.

Sparse Retrieval (BM25 and Variants)

BM25 is a classical term-frequency based retrieval model that remains a strong baseline for RAG. It ranks documents by keyword overlap (with term frequency-inverse document frequency weighting) and often excels at finding passages containing the exact query terms. BM25 has been foundational in open-domain question answering and information retrieval (From Retrieval to Generation: Comparing Different Approaches). However, its purely lexical matching can miss semantically relevant documents that use different wording. As tasks grew more complex, the limitations of keyword-based retrieval spurred development of more semantic search techniques . Still, studies show BM25’s utility: for instance, in retrieval-augmented language modeling tasks, a BM25-based retriever achieved lower perplexity than some generative or hybrid approaches, highlighting the value of precise term matching . In practice, BM25 is often used in the first stage of retrieval for its speed and high recall on keyword matches, and it provides a strong baseline to compare against newer methods.

Dense Vector Retrieval

Dense retrieval uses neural networks (often transformers) to encode queries and documents into high-dimensional vectors, enabling retrieval via nearest-neighbor search in embedding space. Models like Dense Passage Retrieval (DPR) (which uses dual BERT-encoders for questions and passages) and Contriever (an unsupervised contrastive model) encode semantic meaning rather than exact words (From Retrieval to Generation: Comparing Different Approaches). Dense methods can retrieve relevant texts that don’t share obvious keywords with the query, addressing the semantic gap left by sparse methods. For example, dense retrievers have shown strong accuracy in open-domain QA: DPR achieved about 50.2% top-1 accuracy on Natural Questions, substantially better than BM25 on that benchmark . Despite these gains, dense retrieval introduces new challenges. Documents must be pre-encoded and stored as vectors, raising scalability concerns for very large corpora . Also, the dual-encoder architecture encodes query and document independently, which can limit fine-grained matching of details . Researchers have improved dense retrieval through better training (e.g. hard negative mining, contrastive learning) and model architectures. Overall, dense retrieval provides the semantic depth that sparse methods lack, and is a core component of modern RAG pipelines.

Hybrid Retrieval Approaches

Because sparse and dense methods have complementary strengths, hybrid retrieval techniques combine them to boost coverage and accuracy. Different retrievers excel at different information: sparse retrievers (like BM25) are precise with explicit keywords (e.g. named entities), while dense retrievers handle paraphrased or fuzzy queries by capturing broader semantic context (LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers). By querying both types (and sometimes additional sources like web search), hybrid systems can find more relevant results than either alone. Indeed, research shows significant improvements when using hybrid search. For example, on the BEIR benchmark of diverse information retrieval tasks, a hybrid approach improved nDCG@10 from 43.4 with BM25 alone to 52.6, demonstrating much stronger retrieval effectiveness (From Retrieval to Generation: Comparing Different Approaches). Prior works have used BM25 alongside dense retrievers to expand the search scope and ensure important keywords aren’t missed . The challenge in hybrid retrieval is merging results from multiple systems and avoiding overwhelming the model with duplicates or less relevant info. Some solutions include learning to ensemble retrievers – e.g. the Ensemble of Retrievers (EoR) approach optimizes how to combine and rank outputs from several retrievers . Overall, hybrid retrieval leverages the precise matching of sparse indexes and the semantic recall of dense models to maximize relevant knowledge intake for RAG.

Query Refinement and Multi-Hop Retrieval

Improving retrieval accuracy often starts with improving the queries themselves. User queries can be ambiguous, overly broad, or in the case of multi-hop questions, too complex to answer with one retrieval. Query rewriting and decomposition techniques have emerged to address this. For example, RQ-RAG (2024) explicitly trains an LLM to refine the input query by rewriting it, breaking it into sub-questions, or adding clarifications ( RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation). By equipping the model with these capabilities, RQ-RAG improved retrieval for both single-hop and multi-hop QA – yielding about a 1.9% higher accuracy than previous state-of-the-art on several QA datasets . Similarly, other works use the LLM to expand queries with additional context. The BlendFilter framework blends the original query with relevant information generated by the LLM itself (and possibly other sources), effectively performing query expansion in a smart way (BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering - ACL Anthology). This “query generation blending” incorporates synonyms, related facts, or rephrasings to ensure comprehensive coverage in the retrieval step . BlendFilter couples this with a knowledge filtering step (discussed later) and was shown to significantly surpass prior baselines on open-domain QA benchmarks .

For multi-hop questions (where answering requires multiple pieces of evidence), advanced RAG systems plan a sequence of searches. One approach is using a high-level module to decompose a complex question into simpler queries whose answers can be combined. LevelRAG (2025) is an example that introduced a high-level “planner” which breaks down a multi-hop query and dispatches these sub-queries to specialized low-level searchers (one sparse, one dense, one web) (LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers) . Each retriever can then focus on what it does best (e.g. keyword search for a specific entity, semantic search for a description) and the results are merged. This hierarchical strategy improved both the completeness (finding all needed facts) and accuracy of the retrieval process in experiments, outperforming prior RAG baselines on several multi-hop QA tasks . Overall, query refinement – whether through LLM-guided rephrasing or breaking queries into parts – has proven effective at guiding retrieval systems to more relevant information.

LLM-Based Ranking and Retrieval Feedback

Beyond retrieving the right candidates, another line of research is improving how we rank and filter retrieved information before generation. Traditional pipelines often use a second-stage re-ranker (like a cross-encoder or a smaller LLM) to sort the retrieved passages by relevance. In 2024, researchers introduced methods to incorporate ranking directly into the LLM’s workflow. Notably, RankRAG instruction-tunes a single LLM to both rank contexts and generate answers (RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs | OpenReview). During training, a small fraction of ranking data is blended into the LLM’s fine-tuning, enabling it to evaluate which retrieved snippets are most useful. This approach outperformed dedicated ranking models (even ones trained on much larger ranking datasets) and led to better answer accuracy (RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs | OpenReview). In fact, RankRAG’s models (built on LLaMA-3) outscored strong baselines like the ChatQA-1.5 RAG model across nine knowledge-intensive benchmarks, and even approached GPT-4 level performance on certain domain-specific QA tasks (RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs | OpenReview). This demonstrates that an LLM can internally learn relevance estimation, simplifying the pipeline and improving end-to-end results.

Connect with me on X (Twitter)

Another advancement is using the LLM to filter and self-correct the retrieved knowledge. One challenge in RAG is that the initial retrieval may return some irrelevant or low-quality documents (“knowledge noise”) that could mislead the generator. BlendFilter’s second component addresses this by having the LLM itself filter out extraneous retrieved data, leveraging its understanding to discard what isn’t helpful (BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering - ACL Anthology). This ensures the final context given to the generator is cleaner and more relevant. Along similar lines, the CRAG (Corrective RAG) framework introduces a feedback loop to handle retrieval failures (Corrective Retrieval Augmented Generation | OpenReview) . CRAG uses a lightweight retrieval evaluator to judge the overall quality of retrieved documents for a query. If the confidence in the retrieval is low (indicating potential missing or incorrect evidence), CRAG can trigger alternative actions – for example, falling back to a large-scale web search to find additional information beyond the local knowledge base . It also applies a decompose-then-recompose step on retrieved texts, meaning it parses retrieved passages to focus on key facts and filters out irrelevant details . These self-corrective mechanisms significantly improved robustness: experiments showed that plugging CRAG into various RAG models yielded notable performance gains on both short-form and long-form generation tasks . Essentially, the model learns to detect when “retrieval went wrong” and fix it before final answer generation.

Finally, ensemble and feedback techniques can be combined. Some systems iterate between generation and retrieval: an initial answer or reasoning from the LLM is used to formulate a better follow-up query, which retrieves more information, which the LLM then uses to refine its answer. This iterative retrieve-generate cycle continues until the model is confident. Such iterative retrieval-generation synergy has been shown to enhance answer accuracy in complex scenarios (by gradually correcting errors and filling knowledge gaps) (LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers) . Although iterative approaches may increase latency, they underscore an important trend: advanced RAG systems increasingly treat retrieval as a dynamic, learnable component of the LLM’s reasoning process, rather than a static one-shot operation.

In summary, improving retrieval accuracy is crucial for effective RAG. Classic BM25 provides a strong starting point, dense retrievers add semantic power, and hybrid methods leverage the best of both. Building on these, recent research (2024–2025) has introduced smarter query processing, multi-hop planning, integrated ranking, and self-correction strategies to push retrieval performance to new heights. These advancements collectively help RAG systems fetch more relevant knowledge, which in turn enables LLMs to generate more accurate and grounded responses (From Retrieval to Generation: Comparing Different Approaches). As RAG continues to mature, we expect ongoing innovation in retrieval algorithms – from better neural retrievers to even more clever LLM-driven search tactics – to further enhance the fidelity and reliability of knowledge-augmented generation.

Sources:

Abdallah et al. (2025). From Retrieval to Generation: Comparing Different Approaches (From Retrieval to Generation: Comparing Different Approaches)
Zhang et al. (2025). LevelRAG: Multi-hop Logic Planning over Rewriting Augmented Searchers (LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers)
Chan et al. (2024). RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation ( RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation)
Wang et al. (2024). BlendFilter: Advancing RAG via Query Generation Blending and Knowledge Filtering (BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering - ACL Anthology)
Yu et al. (2024). RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation (RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs | OpenReview) (RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs | OpenReview)
Yan et al. (2024). Corrective Retrieval-Augmented Generation (CRAG) (Corrective Retrieval Augmented Generation | OpenReview)
Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post