Vector Search Metrics in LLM Document Retrieval: Limitations and Recent Advances

Apr 12, 2025

Browse all previoiusly published AI Tutorials here.

Vector Search Metrics in LLM Document Retrieval Limitations and Recent Advances
Introduction
Limitations of Vector Similarity in Chunked Document Retrieval
- Fragmented Context and Lost Connections
- Semantic Similarity vs Multi-Hop Reasoning
- Coreference and Ambiguity in Queries
- Factual Consistency and Support Verification
Emerging Approaches to Mitigate Retrieval Failures 20242025
- Graph-Based and Multi-Hop Retrieval Strategies
- Enhanced Embeddings and Hybrid Retrieval
- Coreference-Aware Retrieval and Context Integration
- Ensuring Relevance and Factual Consistency
Benchmarks and Datasets for Evaluation 20242025

Introduction

Large Language Models (LLMs) often use vector search to retrieve relevant text chunks from a document corpus. In this paradigm (commonly known as Retrieval-Augmented Generation, RAG), documents are split into chunks, each chunk is embedded into a high-dimensional vector, and similarity metrics like cosine similarity or Euclidean distance are used to find which chunks are most relevant to a given query (QuOTE: Question-Oriented Text Embeddings) . This approach enables LLMs to handle queries beyond their parametric knowledge by providing external context. However, relying on simple embedding similarity metrics has notable failures and limitations in LLM-driven retrieval tasks, especially as we push into long documents and complex queries. Below, we review these limitations—ranging from handling of chunked documents and long contexts to multi-hop reasoning, coreference resolution, and factual consistency. We then discuss emerging solutions (2024–2025) designed to overcome these issues, and summarize the benchmarks that researchers use to evaluate progress in this area.

Limitations of Vector Similarity in Chunked Document Retrieval

Fragmented Context and Lost Connections

When documents are broken into chunks for embedding, contextual relationships across chunks are lost in vector space. A cosine similarity search treats each chunk independently, so it may miss relevant information that is split across multiple chunks or only evident when reading the document as a whole. Researchers have noted that non-structured RAG (retrieval without additional structure) “fails to capture the logical relations between user queries and passages” (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). In practical terms, if a query’s answer is spread across two or more chunks, a naive vector search might retrieve one piece but not the other, because no single chunk contains the full answer. This is exacerbated by the fact that LLMs (and their embeddings) can struggle with very long contexts – important details in the middle of a long text may be de-emphasized or missed (the “lost-in-the-middle” effect (HERE)). The result is that document connections get “lost in vector space,” to borrow a phrase: the retrieval may return individually similar chunks while missing the broader context linking them.

Another manifestation is the difficulty with succinct or specific queries. If a user query is very short or refers to a tiny detail, the relevant document chunk might not share enough overlapping terms or semantics with the query to stand out by cosine similarity. As one study put it, the “naive RAG approach can fail to capture the intent behind user queries, especially when queries are succinct (e.g., entity lookups) or require extracting specific details from a chunk.” (QuOTE: Question-Oriented Text Embeddings) In such cases, a relevant chunk could be overlooked because its embedding isn’t an obvious match to the query embedding. Essentially, simple vector metrics have no understanding of the importance of a detail in a chunk relative to a query – they operate on coarse semantic similarity, which may not correspond to relevance for fine-grained questions.

Moreover, the choice of chunk size and overlap creates a tension: large chunks preserve more context (helping similarity metrics catch context-dependent meaning) but dilute the embedding with unrelated content; small chunks focus on specific content but lose surrounding cues needed to interpret a query. There is no easy fix via cosine distance alone. Without additional handling, chunking can lead to scenarios where relevant information is split between chunks such that neither chunk’s vector is similar enough to the query – a limitation of the metric’s inability to compose information from multiple pieces. In summary, embeddings of isolated chunks cannot fully represent the narrative or logical flow of a long document, so cosine similarity often falls short in capturing long-range context and connections.

Connect with me on X (Twitter)

Semantic Similarity vs. Multi-Hop Reasoning

Standard vector retrieval excels at finding single passages closely related to a query, but it fails in scenarios that require multi-hop reasoning or combining information from different sources. Cosine similarity is a blunt instrument: it will retrieve chunks that individually look similar to the query, but complex questions often need multiple pieces of evidence. If a question involves reasoning across two or more facts (each possibly in different chunks), a one-shot embedding search tends to retrieve one relevant chunk and miss the others, or retrieve some pieces that are only tangentially related. A recent analysis by Liu et al. (2024) reveals this “imperfect retrieval” phenomenon quantitatively: even with advanced dense retrievers, the highest recall on multi-hop QA tasks saturates around 45%, meaning over half of the needed supporting passages were not retrieved (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). They found that a large proportion of retrieved passages were indirectly relevant – related to the query but not containing the answer – while many truly needed passages were missing . In essence, the retriever (using cosine similarity) pulls in some pieces of the puzzle but not all, because it has no notion of needing to cover different aspects of a multi-faceted question.

This limitation is evident on benchmarks like HotpotQA and 2WikiMultiHopQA, where answering a question requires two hops. A one-round retrieval (single vector query) often fails on such tasks. As Zhuang et al. (2024) observe, one-round RAG works for questions that “clearly state all the needed information in the input query”, i.e. single-hop questions, but “could fail in complex questions where more information is required beyond the first-round retrieved information, e.g., multi-hop questions.” (HERE) The first retrieval may get one relevant chunk (related to one part of the question) but not the next hop. For example, a question might ask: “Find the author of the book that won the 2022 award for best novel.” The query implicitly requires two pieces: (1) identify which book won the 2022 best novel award, and (2) find that book’s author. A cosine-similarity search on the whole query might retrieve a chunk about the award winner (the book) or a chunk about the author of some book, but without special handling it’s luck of the draw if it retrieves both relevant chunks in the top-k. Often it will retrieve multiple chunks about the award (since those all seem semantically similar to the query) and miss the author info or vice versa. Indeed, Tang & Yang (2024) reported a “significant gap in retrieving relevant evidence for multi-hop queries” even with strong embedding models (HERE). Using a variety of embeddings plus a reranker, they achieved only ~74.7% Hits@10 and ~66.3% Hits@4 on their multi-hop dataset, highlighting the difficulty of getting all needed pieces with naive similarity alone . The recall@4 being in the 60s is especially problematic because an LLM typically can only take a limited number of chunks as context (perhaps 4–10); if key evidence isn’t in those, the model’s answer will be incomplete or incorrect.

Low recall for multi-hop retrieval directly causes incomplete or incorrect LLM responses. If the model doesn’t see the necessary fact in its context, it might either try to infer it (risking a hallucination) or simply not mention it. As Liu et al. (2024) note, missing passages lead to “inaccurate or incomplete LLM responses”, especially in multi-document QA (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). Empirically, when comparing LLM performance with gold evidence vs retrieved evidence, the gap is stark. For instance, on a multi-hop QA benchmark, GPT-4 could achieve 89% accuracy when given all the ground-truth supporting passages, but only ~56% accuracy with the top-k retrieved chunks . The vector retriever’s failure to fetch all relevant information was identified as the primary cause of GPT-4’s drop in performance . In other words, cosine similarity might find chunks that are related to the query, but not necessarily the exact clues needed to answer it. Multi-hop reasoning requires capturing relationships (sometimes loose or implicit) between pieces of text – something beyond the capabilities of static embeddings of individual chunks. Without a mechanism for multi-step retrieval or reasoning-aware matching, purely similarity-based search will struggle, returning at best one hop and leaving the LLM to guess the rest.

Coreference and Ambiguity in Queries

Vector search metrics also struggle with queries or contexts that involve coreference or require understanding in-document references. Cosine similarity works on the surface semantics of the query and chunks. If the query is ambiguous or contains pronouns referring to earlier context, an embedding-based search may fail to retrieve the right chunk because the query alone is incomplete. A classic scenario is a follow-up question like “What did he do next?” (where “he” refers to a person mentioned in previous text). If we treat this as an isolated query for the vector database, the embedding of “What did he do next?” is almost meaningless – it might retrieve some random chunk containing a male pronoun or simply the most generic “he” references. The retrieval fails because the query doesn’t contain the disambiguating context (who “he” is). Even within a single long document, a query might ask about information that is described across different sections: e.g., “Was the company eventually acquitted?” where “the company” was named paragraphs earlier. Unless the chunk embedding somehow carries that coreference link (which it usually doesn’t), the cosine similarity won’t match the pronoun “the company” in the query to the chunk that contains the actual company name and its outcome in the text.

Coreference resolution is a known challenge for long-context understanding. Liu et al. (2025) point out that “multiple entities and coreference relations in long contexts” make it difficult for even large models to effectively utilize the information (HERE). In the context of retrieval, this means an LLM might fail to link together pieces of text split into different chunks: one chunk mentions a person or object by name, another later chunk uses “he/she/it” – each chunk alone is only partially informative. Cosine similarity won’t inherently connect those, since the embeddings are typically derived from local context only. The result can be that retrieved chunks lack disambiguated meaning, leading to confusion or irrelevant results. From the retriever’s perspective, a query with a pronoun is ambiguous (many chunks might have “he”); from the LLM’s perspective, if the correct chunk wasn’t retrieved, it has no chance to resolve the reference and answer correctly.

Even when the query itself is clear, coreferences within the documents can pose issues. For example, a query might ask “According to the report, what problems did the project face?” Suppose the document has a section that says “Project X faced several problems…” and then later refers to “these problems” or “they” across subsequent paragraphs. If chunking cuts these sections apart, one chunk might list the problems and another chunk might contain the phrase “these problems” with additional commentary. A naive similarity search on the query could retrieve the latter chunk (because it has words like “project” and “problems” in context, maybe matching the query embedding), but that chunk by itself might not enumerate the problems – it references them indirectly. Without the first chunk that actually defines the problems, the LLM could be left with incomplete information, potentially yielding an answer that is not fully grounded or skips details. This is essentially a context fragmentation issue amplified by coreference: the vector metric has no way to “know” that one chunk refers to information in another.

Recent research confirms that dealing with coreference can significantly improve retrieval and QA in long contexts. For instance, one approach applied coreference resolution across long documents and found that it “enhances the quality of context, reducing information loss and contextual ambiguity”, thereby allowing the LLM to answer more effectively (HERE). In other words, by explicitly resolving who/what each pronoun or reference refers to, the system mitigates a major weakness of the embedding metric – it no longer has to guess the semantics of an ambiguous chunk or query. Without such intervention, however, cosine similarity on raw text is blind to these nuances. It treats “the company” and “Acme Corp” as different tokens with different vector representations, even if they refer to the same entity; and it has no built-in way to connect a pronoun to the noun it stands for. This limitation means that semantic search can fail on queries requiring coreference resolution or understanding of context-dependent terms. The retrieval might return topically related chunks that nonetheless don’t resolve who or what the query is about, leaving the heavy lifting to the LLM (which may or may not manage, depending on whether the needed chunk was retrieved at all).

Factual Consistency and Support Verification

By design, cosine similarity does not account for truth or consistency – it only measures semantic closeness. This can lead to issues with factual consistency in LLM retrieval settings. The retriever might pull in chunks that are on-topic but not factually aligned with the correct answer, or it might include redundant information that confuses the model. For example, if asking about a specific statistic or outcome, a vector search could retrieve multiple related descriptions, some of which contain outdated or contradictory figures (especially if the corpus isn’t internally consistent). The LLM then has the non-trivial task of reconciling conflicting info or deciding which chunk to trust, something it might do poorly without explicit guidance.

A known failure mode is when the retriever brings back text that is generally relevant to the query but does not actually answer the question or contains a different answer. Since the retriever optimizes for similarity, it doesn’t distinguish whether a chunk contains the correct answer. It might retrieve a chunk where the queried entity is mentioned, but not in the needed context (e.g., a query asks for an event in 2021, and the retriever returns a chunk about a similar event in 2019 because the query keywords matched). If the LLM uses that chunk, the answer could be factually off. In multi-hop cases, as discussed, missing one of the required facts can cause the LLM to either guess or give an incomplete answer. Chen et al. (2024) and others have emphasized that these retrieval omissions or errors are a major source of factual errors in the final LLM output (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). Essentially, a similarity-based retriever has no concept of “supporting evidence” beyond topical relevance, so it can introduce irrelevant or half-relevant context that the LLM might mistakenly treat as evidence.

Moreover, the retriever might pick up on spurious correlations in embeddings. For instance, if asking “Did X win the Y award in 2023?”, an embedding model may heavily weight the presence of “X” and “Y award” in text. It could retrieve a chunk stating that X won the Y award in 2021 (because that chunk is similar in content), which is factually not the answer for 2023. The LLM might see “X” and “Y award” and incorrectly assume it has the answer (if it fails to notice the date), thus producing a factually inconsistent response. This is a failure of the similarity metric to ensure that retrieved content actually satisfies the query conditions – it only ensured the content is broadly about the same entities/topics.

The issue of conflicting information is also pertinent. If multiple top chunks are about the query but contain different details, the LLM’s answer may become a mishmash or it may choose one arbitrarily. Cosine similarity won’t, for example, put lower score on a chunk that contradicts another; it treats each chunk independently. Without advanced reranking or filtering, it’s possible for an LLM to be given both correct and incorrect pieces of information as context. Ensuring factual consistency would require the system to recognize contradiction or irrelevance, which basic vector retrieval does not do. Indeed, new benchmarks for Factual Consistency Evaluation (FCE) in RAG systems (e.g., Face4RAG in 2024) have been developed to categorize such errors (Factuality in LLMs: Key Metrics and Improvement Strategies - Turing). They show that factual consistency errors can stem from the model using unsupported statements or getting confused by multiple references. Many of these errors trace back to retrieval retrieving related but not directly answering content.

One telling result: when high-quality retrieval is present, LLM answers improve dramatically. We already noted the gap between GPT-4’s 56% accuracy with retrieved chunks vs 89% with gold evidence (HERE). The authors explicitly state “this is expected, because the retrieval component falls short in retrieving relevant evidence” – highlighting that missing or irrelevant chunks dragged the accuracy down. This gap illustrates how crucial correct retrieval is for factual correctness: a top-tier model knew the answer when given the right info, but the vector search’s limitations meant it often didn’t see the right info. Another study found that even when an LLM is augmented with retrieval, it sometimes still hallucinates or gives wrong answers if the retrieved docs don’t fully support the query . In those cases, the model either misinterpreted partial context or filled in from its own training, leading to inconsistency with the sources. The root cause is that cosine similarity doesn’t guarantee a perfect match to the needed evidence, and without further checks, the pipeline can end up using content that only partially aligns with the facts in question.

In summary, vanilla vector metrics are agnostic to truth and logical consistency. They can retrieve thematically relevant text that might not directly answer the question or might introduce inaccuracies. This limitation means additional measures are needed to ensure the retrieved context truly supports the query – a simple distance threshold or top-k approach by itself isn’t enough to ensure factual correctness in LLM outputs.

Emerging Approaches to Mitigate Retrieval Failures (2024–2025)

Graph-Based and Multi-Hop Retrieval Strategies

One direction augments the retriever with a notion of logical or multi-hop reasoning. Instead of treating each chunk in isolation, chunks are linked via inferred relationships, and the retrieval process itself can perform multiple hops. HopRAG (Liu et al., 2024) is a prime example: it constructs a passage graph where nodes are text chunks and edges represent logical connections between chunks (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). These connections are established by using an LLM to generate “pseudo-queries” that link one chunk to another (for instance, a question whose answer in one chunk might be found in another chunk) . At query time, HopRAG doesn’t just do a single similarity search. It starts with a standard retrieval to get some initial passages, then enters a retrieve–reason–prune loop . In this loop, it explores neighbors of the initially retrieved passages on the graph (essentially performing a controlled multi-hop search) and uses LLM-based reasoning to decide which of those neighbors are truly relevant to the query. Irrelevant passages get pruned, and relevant ones are added to the set. This process allows the system to find logically relevant passages that a pure semantic similarity search would miss, effectively following a chain of reasoning across the graph. HopRAG showed impressive gains: in their evaluation, it achieved “76.78% higher answer accuracy” and over “65% improved retrieval F1” compared to conventional one-shot retrieval on multi-hop QA . This underscores that introducing a reasoning mechanism can dramatically reduce the multi-hop retrieval failure rate.

Another approach is iterative retrieval with query reformulation. Instead of embedding the original question once, these methods break the query into sub-queries or use the answer from the first retrieval to form a new query. For example, a system might first retrieve about “the book that won the 2022 award” (from our earlier example), find that it’s “The Example Novel”, then formulate a new query “Who is the author of The Example Novel?” to retrieve the second piece. Traditionally, this has been done with the aid of the LLM (the LLM generates the sub-query after reading initial results). However, that can be costly if it requires multiple LLM calls. In 2024, Zhuang et al. introduced EfficientRAG, which performs multi-round retrieval without needing a full LLM call at each step (HERE). EfficientRAG has a lightweight query generator that produces new search queries based on what’s been retrieved so far (and a filter to remove irrelevant info on the fly) . By doing this iteratively, it gathers the necessary evidence pieces step by step. This approach was shown to outperform standard one-round RAG on several open-domain multi-hop QA datasets , confirming that a guided multi-hop retrieval strategy can locate information that cosine similarity alone would likely miss. The key idea is to simulate a human researcher: find something, use that to find the next thing, and so on, rather than hoping the one-shot similarity will retrieve everything in one go.

Related to this are techniques like query decomposition (explicitly splitting a complex query into simpler parts) and LLM-based agents for retrieval. An example of the former: given a question, an algorithm (or an LLM prompt) first generates two sub-questions corresponding to the two hops needed, retrieves answers for each, and then combines them. This was explored in earlier research (e.g., for HotpotQA), but in 2024 it’s become a widely utilized approach in pipelines and tools like LlamaIndex (HERE) . LlamaIndex, for instance, can take a compound query and break it down, retrieving documents for each part before synthesizing an answer. By doing so, it bypasses the limitation of a single similarity search. Similarly, LLM-based retrieval agents (sometimes dubbed “ReAct” or “chain-of-thought” agents) have been proposed, which allow the LLM to iteratively issue search queries, read results, and decide on next steps (as an autonomous agent would). AutoGPT (Gravitas, 2023) was an early example mentioned in literature – it can plan multi-hop web searches – and that idea is being adapted to document retrieval as well. These agents use the LLM’s reasoning ability to traverse a knowledge base in a more flexible way than static embeddings, aiming to cover all relevant ground.

In essence, all these strategies introduce a form of “logic-aware” retrieval that goes beyond pure semantic similarity (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). They address the core problem that we identified: cosine similarity lacks a notion of multi-step logical relevance. By incorporating either a graph structure, iterative logic, or an agent that can handle multiple queries, the retrieval process can recover information that was previously falling through the cracks. Early results (HopRAG, EfficientRAG, etc.) demonstrate far fewer failures on complex queries, validating that these multi-hop approaches can dramatically reduce the instances where an LLM says “I don’t have enough info” or, worse, hallucinates an answer due to missing context.

Enhanced Embeddings and Hybrid Retrieval

Another avenue of improvement is making the similarity function itself smarter, either by obtaining better embeddings or by combining multiple metrics. If the cosine metric was operating on embeddings that truly captured all nuances of relevance, its performance would naturally improve. In 2024, there have been attempts to train or adapt embedding models specifically for LLM retrieval tasks. For example, LLM2Vec (BehnamGhader et al., 2024) is a technique that turns a pre-trained decoder-only LLM into a bi-directional encoder to produce powerful text embeddings (HERE) . By using methods like masked token prediction and contrastive learning, LLM2Vec unlocks the rich knowledge of LLMs for encoding tasks. The resulting embeddings achieved state-of-the-art results on standard retrieval benchmarks, meaning they carry more semantic and factual detail that could help in document retrieval. While LLM2Vec wasn’t aimed at multi-hop per se, such improved embeddings can make cosine similarity more effective: chunks that are logically related might end up closer in this new embedding space than they would under a generic embedding. Essentially, if the vector representations are better aligned to the kinds of relevance LLMs care about, the simple metrics (cosine/Euclidean) will inherently work better.

A more direct approach to improving embeddings for retrieval is to augment the content of the chunks before embedding. One clever idea along these lines is generating synthetic queries or FAQs for each chunk during indexing. The previously mentioned QuOTE method (Neeser et al., 2024) does exactly this: for each chunk of text, generate multiple hypothetical questions that the chunk can answer, and embed those question–chunk pairs into the vector store (QuOTE: Question-Oriented Text Embeddings) . At query time, the real user query is more likely to match one of these stored question embeddings if the chunk is truly relevant. This addresses ambiguity and phrasing mismatches – a chunk might not literally contain the same wording as the query, but one of its generated questions might. By enriching chunk representations with semantic “questions it answers,” QuOTE reported substantial gains in retrieval accuracy on diverse benchmarks, including multi-hop QA . In essence, it’s an embedding-level way to inject some context understanding: the cosine similarity is no longer just between query and raw chunk text, but between query and a question-aware embedding of the chunk, which is a closer alignment to the user’s intent . This reduces failure cases where the query and relevant passage were semantically related but phrased differently – a common cause of missed retrievals.

Another line of defense is hybrid retrieval, which combines dense vector similarity with traditional sparse (lexical) similarity. Sparse retrievers like BM25 consider exact term overlap, which can complement embedding methods (which consider semantic similarity). In 2024 it’s become common to use hybrid retrieval in LLM systems (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation) . For example, a system might retrieve top-k results from a vector index and also top-k from a keyword search, then merge or rerank them. The motivation is that what embedding distance misses, keyword matching might catch, and vice versa. If a query has a very distinctive keyword (say a proper noun or a rare terminology), lexical search ensures that any document containing that term is considered, even if the embedding model didn’t emphasize it. Conversely, if the query is conceptual, the vector search brings in things that don’t share exact words. Sawarkar et al. (2024) showed that combining sparse + dense gives better coverage . From our failure analysis: coreference (“he” vs the name) and certain factual details might be better handled when you also look at actual tokens. A hybrid approach can thus mitigate some chunking issues – e.g., it might retrieve the chunk with “John Smith” by a keyword match on “John Smith” from context, while also retrieving the chunk with “he” via semantic match. Together, the LLM then sees both and can resolve the reference.

Beyond retrieval, another increasingly popular step is re-ranking: using a more powerful model to rescore the candidate chunks. This effectively replaces the plain cosine metric with a learned metric that better correlates with actual relevance. A typical pipeline might use cosine similarity to get, say, 50 candidates (fast), then use a cross-attention model (like a mini-BERT that takes the query and chunk text together) to rerank the 50 and pick the best 5 to feed the LLM. In 2024, various reranker models (e.g., BGE-Reranker by Xiao et al., 2023) have been employed in LLM retrieval experiments (HERE) . Tang & Yang (2024) used a reranker and found it improved the relevance of top hits (Hits@10 from ~63% to ~75% in their multi-hop dataset) . However, even with reranking, the performance on multi-hop was far from perfect, indicating that while it helps, it doesn’t solve the fundamental multi-hop recall issue. Nonetheless, rerankers greatly aid in precision – i.e., ensuring that the chunks given to the LLM are truly the most relevant ones. A cross-encoder can understand the query and chunk in context and, for instance, down-rank a chunk that only has a superficial keyword overlap but no actual answer. This addresses some factual consistency issues: the reranker might prefer chunks that answer the question (if it’s trained on QA data). Indeed, research by Ma et al. (2023) and others in 2024 has shown that learned retrieval models can capture nuances like answer presence, which cosine similarity alone ignores.

In summary, improving the embedding space or the retrieval scoring is an important strategy. Whether by training better encoders (LLM2Vec), adding semantic cues (generated questions), using multiple retrieval signals (dense + sparse), or applying smarter scoring (rerankers), these methods all try to compensate for the overly simplistic nature of cosine similarity. By doing so, they reduce failures such as missing relevant passages due to wording differences or retrieving irrelevant ones due to topical but not exact match. While these don’t fully solve multi-hop reasoning by themselves, they create a stronger foundation: the initial candidates are more likely to contain the needed info, giving any subsequent reasoning or the LLM itself a better shot at succeeding.

Coreference-Aware Retrieval and Context Integration

To tackle the coreference and context fragmentation problem, researchers have started building coreference awareness into the retrieval pipeline. One straightforward yet effective idea is to pre-process documents (or queries) with a coreference resolution system before embedding them. By resolving pronouns and replacing them with the actual entity names, the chunks become more self-contained and easier to match with a query. In a long document, this might mean that every time “the CEO” or “he” is mentioned, it’s replaced with the person’s name; “the company” is replaced with “Acme Corp”, etc. This way, if a question asks “What is the CEO’s salary?” the chunk that contains the salary figure might explicitly mention “John Doe’s salary” after resolution, making it much more likely to be retrieved by an embedding search for “CEO’s salary” (since “CEO” gets resolved to “John Doe” or vice versa in both query and doc).

A 2025 study by Liu et al. proposed a method called LQCA (Long Question Coreference Adaptation) that does exactly this for long documents (HERE) . It involves four steps: segmenting a long context, resolving coreferences within each segment, linking them to ensure consistency across the whole document, and then using the cleaned context to answer questions . By “merging coreference information to improve text quality” , they effectively create chunks that preserve referential clarity. The result was a measurable performance boost – for example, on certain QA tasks, GPT-4’s accuracy went up a few percentage points when LQCA preprocessing was applied . This indicates fewer retrieval/understanding failures due to ambiguous references. With coreferences resolved, the vector similarity has an easier job: a query and a chunk that talk about the same thing will more likely use the same explicit terms, hence their embeddings will be closer. One can view this as adding a layer of co-reference reasoning outside the model, rather than expecting the model or embedding to handle it implicitly.

Another technique is to expand queries with context from previous turns or related entities. In multi-turn dialogues, systems now often concatenate the conversation or at least the needed parts of previous queries when constructing the vector search query. That way, something like “What did he do next?” becomes “What did John Doe do next in [context X]?” internally before embedding. This again gives the similarity metric more to work with. Some retrieval frameworks incorporate anaphora resolution or entity tracking modules to maintain a set of known entities from prior text and automatically add them to queries. This prevents many obvious failures where the vector search would otherwise be flying blind with an incomplete query.

Beyond queries, ensuring that related chunks are grouped or retrieved together can help maintain context. One idea is chunk clustering: if you know two chunks from a document are tightly linked by references (say one introduces an entity, another elaborates on it), you could store an index such that retrieving one chunk also brings its “buddy” chunk along. There is ongoing exploration in 2024 of using embeddings not just for individual chunks but for sets of chunks or hierarchical retrieval. For instance, Sarthi et al. (2024) looked at tree-structured retrieval within documents (almost like retrieving a section that contains sub-sections) (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). While their approach was more about hierarchy than coreference specifically, it shares the goal of preserving intra-document relationships that flat chunk retrieval ignores. A drawback noted is that purely tree-based methods can introduce redundancy and don’t handle cross-document links, but they are steps toward retrieval that is aware of more than one chunk at a time.

In summary, coreference resolution and context linking techniques serve to glue back together the pieces that chunking breaks apart. By doing so, they reduce cases where a query doesn’t match a relevant chunk due to wording discrepancies. These techniques, when applied, directly address some failure modes of cosine similarity: after resolution, the embeddings of query and answer chunk become more aligned. The LLM also benefits at generation time, because it sees a more coherent context (e.g., fewer “this” or “he” with unclear antecedents). While this doesn’t introduce new knowledge, it amplifies retrieval recall and precision by ensuring that important referents are explicit. As a result, it mitigates those frustrating instances where the system had the information in the corpus but just didn’t realize two pieces were connected.

Ensuring Relevance and Factual Consistency

To address factual consistency and reduce the inclusion of irrelevant information, researchers have been adding verification and filtering layers on top of retrieval. One simple but effective idea is to use the LLM (or another model) to double-check whether a retrieved chunk is actually likely to contain the answer to the query, before using it. If not, it can be dropped or replaced with another candidate. This acts as a sanity check beyond raw similarity score. For instance, after retrieving top-k chunks by cosine similarity, one could prompt an LLM with something like: “Does this text contain information that answers the question X? (yes/no)” for each chunk. If the model (or a classifier) says “no” for a chunk, that chunk is removed from context. This kind of retrieval filtering helps in scenarios where the vector search returns something broadly related but not actually useful for the question. It directly tackles the earlier issue of the LLM being given distracting or non-answer-bearing text.

EfficientRAG, mentioned earlier, includes a component that labels whether retrieved chunks are “helpful or not” for the query (HERE) . It then only keeps the helpful ones and uses them to formulate the next query, iteratively. By filtering out chunks deemed irrelevant at each step, it prevents the retrieval process from veering off-topic or accumulating noise. This is crucial because an LLM’s context window is precious – filling it with irrelevant passages not only wastes space but can degrade the model’s focus on the relevant bits (LLMs might attend to the irrelevant info and get confused). Zhuang et al. showed that this filtering improved answer accuracy, implying the LLM was getting a cleaner, more on-point context to work with .

On the factual consistency front, an interesting development is the idea of “retrieve then verify”. After the LLM produces an answer, systems can perform a secondary vector search (or use an external API) on that answer to see if it’s supported by any document. If not, the system might flag the answer as potentially unsupported. Some 2024 works suggest having the LLM justify each answer sentence with a source (and then checking those sources). When inconsistency is detected, one approach is to make the LLM abstain rather than hallucinate. For example, a strategy described by Lin et al. (2024) is to have multiple LLMs collaborate or cross-check each other, effectively asking the model to reconsider if an answer cannot be grounded in retrieved docs (The challenges in using LLM-as-a-Judge - Sourabh Agrawal | Vector Space Talks - Qdrant) . While this veers into answer generation evaluation, it is tightly coupled with retrieval quality: it introduces a feedback loop where if the retrieved info wasn’t enough or was contradictory, the system notices and either tries a different approach or refuses to answer (better to abstain than be confidently wrong).

Another set of techniques involves post-processing the retrieved set for consistency. For example, if two retrieved chunks appear to give different answers (say one says “Profit was $1M” and another says “Profit was $2M” for the same year), a system could flag that and either query more to resolve the discrepancy or present both to the user. In 2024, some systems aimed at high factual reliability have included multiple passes: one to retrieve evidence, one to let the LLM draft an answer, another to retrieve again based on that draft (to verify each claim), and then a final pass to correct any unsupported claims. This is sometimes called “generate-then-ground” (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval ... - arXiv). Such workflows are complex, but they are direct responses to the observation that pure similarity metrics may not fetch exactly the needed supporting facts at first try.

It’s also worth noting that evaluation benchmarks are pushing for these checks. The Face4RAG benchmark (Xu et al., 2024) we mentioned defines categories of factual errors, such as “unsupported inference” or “contradiction” (Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese) . By evaluating RAG systems on these fine-grained errors, researchers have identified patterns (e.g., models might be good at factual precision but weak at identifying missing info). These insights feed back into retrieval improvements: for instance, if a model often makes up a detail because the detail wasn’t retrieved, one solution is to incorporate a targeted retrieval for that detail (like an intermediate question asking specifically for it). We see in Tang & Yang (2024)’s MultiHop-RAG work that they restricted certain answer formats to avoid free-form hallucination and ensured the system knew when to say “I don’t know” if something wasn’t found (QuOTE: Question-Oriented Text Embeddings). All these are measures to compensate for retrieval not being 100% perfect – acknowledging that rather than letting the model spout an error, we either fix the retrieval or guard the output.

In practical systems (like those built on GPT-4 with retrieval plugins, etc.), a combination of the above is used. For example, Microsoft’s Bing Chat (an LLM with retrieval) is known to use a booster search query if the first doesn’t find a high-confidence answer, and it will refuse to answer certain queries if it’s not confident in the sources. These are essentially the product-grade versions of the mitigation strategies researchers are exploring: multi-try retrieval, source validation, and graceful handling of uncertainty.

In summary, to overcome the limitations of cosine similarity in semantic search, the community is moving towards retrieval that is integrated with reasoning and validation. By doing multi-hop search, enriching embeddings, resolving references, and verifying supports, we address the root causes of retrieval failures. The simple cosine similarity and Euclidean distance aren’t thrown away – they are often still the backbone – but they are augmented with layers of intelligence that compensate for what they miss. The result is a marked reduction in cases where the LLM “fails” due to retrieval: fewer missed relevant chunks, fewer irrelevant distractors, and more robust, factually consistent use of retrieved information.

Benchmarks and Datasets for Evaluation (2024–2025)

Researchers rely on a variety of benchmarks to evaluate how well these metrics and methods are working for LLM document retrieval. Here we highlight some of the most commonly used datasets (in 2024–2025) that test the limits of vector search metrics in chunked retrieval scenarios:

HotpotQA (Wikipedia multi-hop QA): Originally introduced in 2018, HotpotQA remains a go-to benchmark for multi-hop retrieval and reasoning. It contains questions that explicitly require finding two supporting Wikipedia paragraphs and reasoning across them. HotpotQA is frequently used to evaluate whether a retriever+LLM system can handle multi-hop, since it tests retrieval recall (you need both paragraphs) and the ability to do comparative reasoning. Many 2024 studies (e.g., on EfficientRAG and HopRAG) report results on HotpotQA to demonstrate improvements over a baseline retriever (HERE). The difficulty for vector search here is high – a system must fetch multiple related chunks from a huge corpus, so it’s an excellent stress test for retrieval methods.
2WikiMultiHopQA and MuSiQue: These are more recent multi-hop QA datasets, similar in spirit to HotpotQA. 2WikiMultiHopQA (Ho et al., 2020) involves questions that require information from two Wikipedia articles (often with one hop being a bridge entity) . MuSiQue (Trivedi et al., 2022) contains questions broken down into multiple subquestions (up to 4 hops) to answer the main query. Both have been used in 2024 research to assess multi-hop retrieval performance . For instance, Liu et al. (2024) used 2WikiMultiHop and MuSiQue to measure the recall of their dense retriever, finding that recall plateaued at ~0.45 on these tasks without advanced methods (HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation). These datasets are challenging because they require high recall: missing one supporting fact means failing the question. They help quantify how often cosine-similarity retrieval alone drops needed info, and how much new methods improve that.
MultiHop-RAG dataset (Tang & Yang, 2024): This is a benchmark introduced in 2024 specifically for evaluating retrieval-augmented generation on multi-hop queries. It consists of 2,555 queries, each requiring retrieval from multiple documents (QuOTE: Question-Oriented Text Embeddings). What makes MultiHop-RAG valuable is that it was curated to reflect realistic multi-hop information needs (e.g., a query might require piecing together facts from different sources, not just Wikipedia). Each question in MultiHop-RAG comes with a list of supporting fact sentences across different sources , ensuring that any system tackling it must perform cross-document retrieval and synthesis. In Tang & Yang’s paper, they use this dataset to benchmark various embedding models and retrieval strategies (HERE). It has quickly become a reference point in 2024 for testing improvements in multi-hop retrieval. For example, QuOTE’s evaluation included MultiHop-RAG to show its question-generation indexing improved multi-hop retrieval success . MultiHop-RAG’s focus on cross-document reasoning makes it especially relevant for testing chunked retrieval: it’s a direct measure of how well a system can overcome the limitations of embeddings when multiple pieces of text must be combined.
Natural Questions (NQ) and TriviaQA: These are classic open-domain QA datasets. NQ (Google, 2019) features real user questions and expects answers from Wikipedia; TriviaQA contains trivia questions and answers with supporting evidence. While they are mostly single-hop, they are still widely used in 2024 as benchmarks for retrieval quality in open-domain settings. For instance, they are included in evaluations of improved retrievers like QuOTE to ensure that any specialization for multi-hop doesn’t hurt the ability to fetch single-hop facts. NQ in particular is used to report top-k retrieval accuracy (e.g., top-5 or top-20 recall) for new embedding models or hybrid search setups . These datasets help validate that a new method isn’t overfitting to multi-hop scenarios but also handles straightforward queries efficiently. They also often serve as a baseline: if a method can’t match the near-perfect retrieval accuracy on NQ achieved by say, BM25 or DPR (dense passage retriever), then it’s likely not viable. So, they continue to be “sanity check” benchmarks.
L-Eval (Long Context Evaluation Benchmark): Introduced at ACL 2024 (L-Eval: Instituting Standardized Evaluation for Long Context...), L-Eval is a comprehensive benchmark of 20 tasks designed to test long-context understanding by LLMs. It includes QA tasks, but also summarization, reading comprehension, and others, all with long inputs (ranging from thousands to tens of thousands of tokens). For retrieval, L-Eval is relevant because some of its tasks involve multi-document retrieval and QA as well as cross-document summarization. One notable aspect of L-Eval is that it tries to standardize evaluation of Long-Context Language Models by covering multiple domains and task types . In the context of vector search metrics, L-Eval provides a way to see if improvements in retrieval actually help LLMs utilize long contexts better. For example, a system might be evaluated on an L-Eval task where it has to answer questions based on a set of 5 related documents (totaling, say, 30k tokens). Without good retrieval, an LLM with a 32k context window might still flounder because it can’t pick out the right info. L-Eval’s results have shown that many LLMs struggle as context length increases, indicating room for retrieval augmentation (HERE) . Researchers use it to demonstrate that their method (e.g. a new long-context handling or retrieval scheme) actually yields better performance on realistic tasks, not just synthetic setups.
LongBench and Loong: LongBench (Bai et al., 2023) is another benchmark focusing on long-context tasks, notable for being bilingual and multi-task . It has tasks like long document QA, where models must find specific information buried in very lengthy text. LongBench was used in 2024 to evaluate methods like InfiniteScroll (for extending context length) and to validate that retrieval methods scale to long inputs. Loong (Wang et al., 2024) is a new multi-document QA benchmark nicknamed “Leave No Document Behind” (Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA). It was created to avoid the common practice of padding long-context benchmarks with irrelevant text; instead, in Loong every document is relevant to the query and if you ignore any one, you’ll likely answer incorrectly . Loong includes different task types like comparison and chain-of-reasoning across documents, with various context lengths. Early experiments on Loong showed that retrieval-augmented methods performed poorly, revealing that current LLM+retrieval systems aren’t fully solving the long-context QA problem . Loong is quickly gaining attention as a benchmark that is closer to real-world use: a model might have 10 documents, all useful, and it needs to synthesize an answer. For vector search metrics, this is a trial by fire: your retriever can’t just find some relevant docs – it needs to surface all of them (since all matter). It really tests the recall and prioritization of retrieval. If an approach (like HopRAG or iterative retrieval) claims to improve multi-hop reasoning, Loong is an excellent validation set to prove it, because a simple cosine similarity baseline would likely miss pieces and score low.
Factual Consistency and “Hallucination” Benchmarks: Besides tasks that directly test retrieval, there are benchmarks that evaluate the faithfulness of LLM responses to provided context. For example, the TruthfulQA benchmark (Lin et al., updated 2023) tests whether models produce verifiably true statements, though it’s not specific to retrieval. In a retrieval context, we have things like QAGen or RARR which generate questions answerable by given text and see if the model’s answer matches the text. However, a more targeted one is RefEval/RefChecker (2023) which checks if an answer is supported by a reference document. Building on such ideas, Face4RAG (2024) constructed a synthetic dataset to evaluate nine types of factual consistency errors in RAG outputs (Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese) . While not a retrieval benchmark per se, it’s used to validate metrics and methods aimed at ensuring factual consistency. For instance, a new retrieval scoring function that penalizes inconsistency could be tested on Face4RAG to see if it reduces errors like “extrinsic hallucinations” (model introduces info not in sources). Additionally, there are emerging challenge sets where an LLM must decide when it doesn’t have enough info. “Don’t Hallucinate, Abstain” (2024) was a theme in some works which introduced datasets for models to either answer from provided docs or say “I don’t know” if the docs don’t contain the answer. These evaluation sets implicitly test retrieval: if the retriever didn’t fetch the needed doc, the model should ideally abstain. Measuring how often it abstains correctly vs hallucinates can indicate retrieval effectiveness.
RAG-specific Evaluation Suites: As RAG systems matured, evaluation toolkits like RAGAS and ARES (2023) were introduced, which include multiple metrics (precision of retrieval, diversity of sources, answer faithfulness, etc.) to holistically assess a RAG pipeline. By 2024, these are sometimes used alongside traditional QA accuracy to analyze performance. For example, one can use such a suite on a set of QA pairs to measure how many relevant documents were retrieved (a retrieval metric) and how factually supported the generated answers are (a generation metric). If an improvement in vector retrieval metrics (like using a new embedding) doesn’t translate to improved answer support, these tools will reveal that. The MultiHop-RAG paper notes that prior RAG evals often focused on the generation quality without specifically addressing retrieval accuracy (HERE). Their work and others are shifting focus back to retrieval quality via datasets like we discussed and using these evaluation frameworks to ensure that better chunk retrieval = better LLM performance.

In conclusion, the period 2024–2025 has seen an expansion of benchmarks that stress-test retrieval in the context of LLMs. From multi-hop QA datasets (HotpotQA, 2Wiki, MuSiQue, MultiHop-RAG) that highlight the recall issues of cosine similarity, to long-context benchmarks (L-Eval, LongBench, Loong) that reveal how hard it is to scale retrieval to huge inputs, to factual consistency tests (Face4RAG, etc.) that emphasize the importance of retrieving the right information – these benchmarks collectively drive research toward more robust solutions. When a new vector search method or hybrid approach comes out, these are the datasets it’s often measured against. The consistent finding is that naive metrics plateau on these benchmarks, whereas approaches that incorporate reasoning, better embeddings, or other enhancements make measurable dents in the problem. By using a diverse set of evaluation tasks, researchers can ensure that progress in vector-based retrieval for LLMs isn’t just overfitting to one scenario, but truly addressing the fundamental limitations we’ve discussed in this review.

Connect with me on X (Twitter)

References:

(QuOTE: Question-Oriented Text Embeddings) Neeser et al., 2024. QuOTE: Question-Oriented Text Embeddings. (Describes how naive RAG with cosine similarity fails for succinct queries and specific details; proposes augmenting chunk embeddings with generated questions to improve retrieval.)

(HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation) Liu et al., 2024. HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation. (Shows that traditional dense retrievers focusing on semantic similarity often retrieve indirectly relevant passages or miss needed passages, leading to incomplete LLM answers; multi-hop QA recall was capped ~45%. Introduces a graph-based retrieval that greatly improves multi-hop retrieval F1 and answer accuracy.)

(HERE) Tang & Yang, 2024. MultiHop-RAG dataset and evaluation. (Reports significant gaps in retrieving all relevant evidence for multi-hop queries with standard embedding models. Even with a reranker, Hits@4 was only ~0.66 on their multi-hop QA benchmark, underscoring challenges of direct similarity matching in multi-hop scenarios.)

(HERE) Zhuang et al., 2024. EfficientRAG: Efficient Retriever for Multi-Hop QA. (Highlights that one-round (single-shot) retrieval often fails on complex multi-hop questions because additional context beyond the first result is needed. Proposes an iterative retrieval without repeated LLM calls, improving multi-hop retrieval by generating new queries and filtering irrelevance in a loop.)

(HERE) Liu et al., 2025. Bridging Context Gaps with Coreference Resolution (LQCA). (Demonstrates that applying coreference resolution to long contexts improves the accuracy of key information extraction by reducing ambiguity and information loss. This suggests resolving “he/it/they” in documents helps LLMs retrieve and use information more effectively in QA.)

(HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation) Liu et al., 2024. HopRAG (Introduction). (Emphasizes that the goal of retrieval should extend beyond lexical or semantic similarity to logical relevance. Due to lack of logic awareness, conventional retrievers often return passages that are topically similar but not actually useful to answer the query, or they miss needed passages entirely – especially harmful in multi-hop and multi-document QA.)

(HERE) Tang & Yang, 2024. Evaluation of LLM performance with vs without correct retrieval. (Finds GPT-4 achieves only 56% accuracy on their multi-hop QA when using retrieved chunks, versus 89% with ground-truth evidence. This large gap is attributed to the retrieval component “falling short in retrieving relevant evidence”, highlighting how retrieval failures directly lead to incorrect answers.)

(HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation) Liu et al., 2024. HopRAG (Method). (Proposes constructing a passage graph with LLM-inferred connections and using a retrieve-reason-prune approach. The system starts from semantically similar passages then logically traverses to find truly relevant ones, guided by pseudo-queries. This approach improved answer accuracy by ~76% over standard retrieval, showing the benefit of multi-hop reasoning integrated into retrieval .)

(HERE) Tang & Yang, 2024. MultiHop-RAG (Embedding vs Reranker results). (Evaluates various embedding models and a BGE reranker on multi-hop queries. Even with a strong retriever+reranker, many questions remain partially answered due to missing evidence. The study underlines the difficulty of multi-hop retrieval and the need for specialized techniques beyond cosine similarity.)

(HERE) Liu et al., 2025. LQCA (Method). (Details a four-step pipeline to resolve coreferences in long documents and integrate mentions, thereby producing a context that an LLM can more easily handle. By replacing pronouns with representative mentions, the text clarity is enhanced , helping the model “better understand” long context queries .)

(HERE) Tang & Yang, 2024. Analysis of query types. (Notes that with retrieval augmentation, models mitigate hallucinations if they can determine a query can’t be answered with given text – effectively identifying when retrieval failed. However, smaller models often misinterpret or fail logical reasoning when retrieval is incomplete, leading to errors in comparison/temporal questions.)

(HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation) Liu et al., 2024. HopRAG (Figure 1 results). (Shows graphically that on multi-hop QA datasets (MuSiQue, 2Wiki, HotpotQA), even the best dense retrievers reach a maximum recall of ~0.45, evidencing “severe imperfect retrieval.” Also categorizes retrieved passages: many are only indirectly relevant (need a hop to reach answer) and a significant fraction are irrelevant – indicating a lot of noise in top-k retrieved by similarity .)

(QuOTE: Question-Oriented Text Embeddings) Tang & Yang, 2024. MultiHop-RAG dataset description. (Describes the MultiHop-RAG benchmark: 2,555 queries requiring retrieval from multiple documents, with each question linked to a list of supporting fact sentences across different sources – ensuring genuine multi-hop reasoning is needed. This dataset was created to explicitly test multi-document retrieval and cross-sentence reasoning in RAG systems.)

(Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA) Wang et al., 2024. Loong benchmark (Long multi-doc QA). (Introduces Loong, a long-context QA benchmark where every document is relevant to the answer. Skipping any document leads to failure, making it a stringent test for retrieval completeness. Initial experiments showed that RAG systems have poor performance on Loong , demonstrating that simply having a long context window or basic retrieval is not enough – models often fail unless they effectively leverage all provided documents.)

Rohan's Bytes

Discussion about this post