Browse all previoiusly published AI Tutorials here.
Table of Contents
Vector Search and ANN Indexing Strategies
Dense Semantic Retrieval (DPR, ColBERT, etc.)
Hybrid Dense+Lexical Retrieval Methods
Efficiency Optimizations: Indexing, Latency, Accuracy, Scalability
Document Chunking and Preprocessing Strategies
Efficient Integration of Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs) benefit greatly from advanced search and retrieval techniques to ground their responses in relevant context. Recent research (2024–2025) has focused on improving vector search, semantic retrieval, and hybrid methods, with an emphasis on efficiency, accuracy, and scalability. Additionally, new document chunking strategies and retrieval-augmented generation (RAG) integrations have been proposed to enhance LLM performance. Below, we survey key findings and innovations in each area, with citations from recent arXiv papers.
Vector Search and ANN Indexing Strategies
Vector search underpins retrieval for LLMs by mapping text to high-dimensional embeddings and using Approximate Nearest Neighbor (ANN) algorithms for fast similarity search. Libraries like FAISS (with IVF and product quantization), HNSW (graph-based ANN), and ScaNN are widely used baselines. However, even with efficient ANN indexes (e.g. HNSW, IVF), dense retrieval often incurs higher latency than lexical search (LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors). Research in 2024–2025 has introduced improved indexing techniques to narrow this gap:
Spilling with Orthogonality (SOAR) – Google Research (2024) introduced an ANN indexing method that trains multiple vector quantization (VQ) indexes with an orthogonality-amplified residual loss. Each index is optimized to cover cases where others fail, reducing missed nearest-neighbors (SOAR: Improved Indexing for Approximate Nearest Neighbor Search). This yields state-of-the-art retrieval accuracy on billion-scale benchmarks while keeping indexing fast and memory low . In short, SOAR’s redundant-but-uncorrelated partitions drastically improve recall without large efficiency penalties.
Multi-Residual Quantization (MRQ) – Yang et al., 2024 proposed a quantization-based ANN method focusing on indexing efficiency and query speed. MRQ achieves higher vector compression (only one-third the code length of prior methods) and uses an efficient distance-correction scheme ( Fast High-dimensional Approximate Nearest Neighbor Search with Efficient Index Time and Space). It outperforms state-of-the-art graph and VQ baselines with up to 3× faster query throughput at the same accuracy , showing that smarter quantization can reduce memory and latency simultaneously.
CrackIVF (Adaptive Indexing) – Zampetakis et al., 2025 presented CrackIVF, a partition-based index that builds itself adaptively as queries arrive (Cracking Vector Search Indexes). Initially, it performs near brute-force search to answer queries immediately (no waiting for index build) and gradually refines the index with each query. After enough queries, CrackIVF’s index quality matches a fully built IVF index . Impressively, it can handle over 1 million queries before other methods finish indexing, with 10–1000× faster initialization than traditional indexing . This on-the-fly indexing is ideal for “cold start” scenarios or rarely accessed data, as it avoids heavy upfront indexing costs.
Dense Semantic Retrieval (DPR, ColBERT, etc.)
Dense retrieval uses neural embeddings to find semantically relevant texts, and methods like DPR and ColBERT are core to many LLM knowledge retrieval pipelines. Recent works examine how to further improve these models’ retrieval accuracy and integration with LLMs:
Mechanistic Analysis of DPR – Reichman & Heck, 2024 investigated what DPR learns during fine-tuning for retrieval. They found that DPR training “decentralizes” knowledge in the model, creating multiple latent pathways to the same information (Retrieval-Augmented Generation: Is Dense Passage Retrieval Retrieving?). However, a notable limitation is that DPR cannot retrieve facts beyond the model’s original pre-training knowledge . In other words, if a fact wasn’t encoded in the pretrained language model, fine-tuning DPR won’t magically make it retrievable. These insights suggest improving dense retrievers by exposing them to new knowledge or representations beyond the base model’s scope. The authors propose ideas like injecting new facts as dense vectors or modeling uncertainty for missing knowledge .
LLM-Augmented Retrieval – To push dense retrieval quality further, Wu & Cao (ICLR 2025) introduced a model-agnostic LLM-augmented embedding framework. The idea is to use a large language model to enhance document representations before indexing. This approach significantly boosted the performance of popular retrievers – including bi-encoders like Contriever and late-interaction models like ColBERTv2 – achieving new state-of-the-art results on benchmarks (LoTTE, BEIR) (LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding | OpenReview). Essentially, the LLM acts as a powerful document understander (e.g. generating richer embeddings or contextual summaries), which, when fed into standard retrieval models, yields superior accuracy.
Weakly-Supervised Dense Retrieval (W-RAG) – Nian et al., 2024 tackled the problem of scarce training data for dense retrievers in open-domain QA. Their system W-RAG uses an LLM itself to generate training labels for the retriever (W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering). Specifically, they retrieve candidate passages with BM25, then use an LLM to re-rank them by estimating the likelihood that each passage contains the correct answer. The top passages (according to the LLM) serve as positive examples to fine-tune the dense retriever . This LLM-guided weak labeling substantially improved retrieval performance and end-to-end QA accuracy on multiple benchmarks, compared to training the dense retriever with no or limited human-provided labels . It demonstrates a clever synergy: using an LLM’s reasoning ability to teach a smaller retriever model where to look for evidence.
Hybrid Dense+Lexical Retrieval Methods
Hybrid retrieval combines lexical keyword matching (e.g. BM25) with dense semantic matching, aiming to get the precision of keyword search and the recall of embeddings. Several 2024 studies highlight the value of hybrid approaches:
Lexical + Dense Complements – Mandikal & Mooney, 2024 showed that for domain-specific corpora (e.g. scientific literature), dense embeddings alone may not outperform classical methods. In their experiments on medical paper retrieval, a SOTA dense model (SPECTER2) gave no significant boost over BM25. But a hybrid model integrating sparse and dense signals “yielded significantly better results” than either alone ( Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval). This underscores that lexical and neural methods often retrieve different relevant documents, so combining them improves overall recall/precision.
LexBoost (Dense-guided Lexical) – Mitra et al., 2024 proposed LexBoost, a novel hybrid that retains the speed of lexical search while leveraging dense relationships (LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors). Instead of merging dense and sparse scores at query time, LexBoost uses a dense retriever offline to identify clusters of similar documents (each document’s nearest neighbors in embedding space). At query time, only a standard lexical search is run – but when scoring a document, LexBoost also considers the BM25 scores of its neighboring documents (under the intuition that if related docs have high scores, the document is likely relevant too) . Because the neighbor mapping is precomputed, this method adds negligible latency at query time . Experiments showed statistically significant gains over strong lexical baselines (BM25, QL, etc.) , essentially “boosting” lexical retrieval effectiveness by harnessing dense similarity, without the typical speed penalty of neural methods.
Unified Dense-Sparse Index (Graph ANNS) – Zhang et al., 2024 addressed hybrid retrieval at the systems level, by building a single ANN index for combined dense+sparse vectors. Merging these representations is challenging (different scales and dimensions) and can be slow. They proposed a graph-based ANN algorithm with two key innovations: (1) a distribution alignment technique to normalize and balance contributions of sparse vs. dense components (improving hybrid retrieval accuracy by up to 9%) (Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search), and (2) an adaptive two-stage search that first uses only dense vector distances (fast) and then gradually incorporates sparse dimensions, pruning insignificant sparse features to save time . This approach yielded a remarkable 8.9×–11.7× end-to-end throughput improvement at the same accuracy compared to naive hybrid search implementations . In essence, by intelligently fusing indexing and search for lexical and semantic signals, they achieve the best of both worlds: robust hybrid retrieval that is still efficient and scalable.
Efficiency Optimizations: Indexing, Latency, Accuracy, Scalability
A recurring theme across recent works is optimizing retrieval along multiple axes simultaneously. Key areas of focus and advancements include:
Indexing Efficiency & Scalability: Building indexes faster and handling larger corpora has seen progress. The CrackIVF approach (above) drastically cuts indexing delay by incremental index construction, allowing immediate query service on new data (Cracking Vector Search Indexes). In terms of scale, methods like SOAR demonstrated that even billion-vector datasets can be indexed with low memory and reasonable build times while achieving leading accuracy (SOAR: Improved Indexing for Approximate Nearest Neighbor Search). The introduction of unified indexes for hybrid data (Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search) also improves scalability, as a single index can handle both sparse and dense features, avoiding separate systems for each.
Query Latency Reduction: Several innovations explicitly target lower retrieval times. MRQ’s efficient distance computation and higher compression cut down per-query work, yielding up to 3× faster retrieval with no accuracy loss ( Fast High-dimensional Approximate Nearest Neighbor Search with Efficient Index Time and Space). On the algorithm side, LexBoost achieves improvements without any neural network calls at query time, so it adds virtually no latency overhead to standard BM25 retrieval (LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors). Likewise, the graph-based hybrid search uses a two-phase strategy to avoid costly sparse distance calculations until necessary, accelerating queries by an order of magnitude . Overall, a trend is to shift work offline (preprocessing, indexing, caching) so that answering queries can be both fast and accurate.
Retrieval Accuracy: To improve the quality of results (especially for complex information needs), researchers combine complementary signals and better training data. The multi-view indexing of long documents (described below) boosted top-k recall by over 40% in some cases (HERE), directly translating to better LLM answers. Hybrid and LLM-augmented retrievers are consistently reporting new state-of-the-art retrieval accuracy on benchmarks (LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding | OpenReview). Additionally, using LLMs to guide retriever training (e.g. in W-RAG) has improved the relevance of retrieved evidence, thereby increasing end-task performance in QA settings (W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering).
Dynamic Resource Trade-offs: Scalability isn’t just about data size, but also making retrieval adapt to different query needs and cost budgets. For example, one 2025 study introduced a user-controllable RAG framework that can dial retrieval up or down per query (Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control). It uses two classifiers – one favoring high accuracy (more retrieval) and one favoring efficiency (minimal retrieval) – and a knob to balance them. This way, simple questions can be answered quickly with little or no retrieval, while hard questions trigger multi-step retrieval, all under a flexible budget constraint . Such adaptive systems ensure that adding retrieval always helps LLMs answer better, yet avoids wasting computation when it’s not needed.
Document Chunking and Preprocessing Strategies
How documents are segmented and preprocessed can dramatically affect retrieval quality for LLMs (especially for long documents that exceed context windows). Recent research has explored smarter chunking beyond the naive fixed-length approach:
Content-aware Chunking (MC-Indexing) – Dong et al., 2024 proposed Multi-view Content-aware Indexing (MC-indexing) for long documents. Instead of breaking text into arbitrary 200-token chunks, they leverage the document’s structure (sections, subsections, etc.) to create semantically coherent chunks (HERE). Each chunk is then indexed in three forms: the raw text, a keyword list, and a summary of that chunk . This multi-view representation enriches what the retriever “sees” for each section (filling in context that might be missing in raw text alone). Notably, MC-indexing is unsupervised and plug-and-play – it doesn’t require training, and works with any existing dense or sparse retriever to improve their results . In experiments on long-document QA, MC-indexing yielded massive gains in retrieval recall – for example, +42.8% improvement in recall@1 on average (across 8 different retrievers) compared to standard fixed-length chunking . By segmenting by logical content and augmenting chunks with summaries/keywords, the retriever is much more likely to grab the truly relevant sections for answering the question.
Hierarchical Summarization & Chunking – Other works (late 2023 and 2024) have explored building hierarchical representations of long texts to guide retrieval. For instance, researchers have experimented with creating a tree of summaries: first splitting a long document into sections, summarizing each, and even summarizing those summaries, allowing a retriever or LLM to navigate at multiple levels of granularity. Such techniques aim to let the LLM find relevant information by first retrieving a high-level summary, then drilling down into the document only if needed. While details vary by approach (e.g., AttnWalker and RAPTOR are cited as methods for recursive document summarization in prior work), they share the goal of coping with long, structured content more effectively than flat chunk lists . These hierarchical chunking strategies, together with MC-indexing’s multi-view chunks, represent the state-of-the-art preprocessing for long-context retrieval. They improve the chances that an LLM actually gets the needed context (and not irrelevant text) in its input window, thereby boosting answer correctness and reducing hallucinations.
Efficient Integration of Retrieval-Augmented Generation (RAG)
Finally, a number of 2024–2025 papers address how retrieval connects with generation in LLM workflows, aiming to enhance overall system quality and cost-efficiency. Key insights include:
Long Context vs. RAG – When to Use Which: As LLMs with very long context windows (tens of thousands of tokens) emerge, a natural question is whether they still need external retrieval. Li et al., 2024 conducted a comprehensive comparison of Retrieval-Augmented Generation (RAG) vs. Long-Context LLMs (HERE) . They found that a sufficiently powerful long-context model (e.g., GPT-4 or Google’s Gemini) can outperform RAG pipelines in accuracy if it can fit all relevant text in its context . However, RAG remains far more cost-efficient, since retrieving a handful of passages is cheaper than prompting a giant LLM with an entire long document . To get the best of both worlds, they propose SELF-ROUTE, a hybrid system that lets the model self-reflect on the query and decide whether to use RAG or long-context processing . Easy or narrow questions can be answered via RAG (cheaper), while complex ones leverage the long context model. This dynamic routing preserved the strong performance of the long-context model while cutting computation costs by a large margin . It illustrates a practical approach to reduce inference cost: use expensive LLM capacity only when necessary, otherwise fall back on lighter retrieval+generation.
Cost-Constrained Retrieval Optimization: Instead of treating retrieval size as fixed, Wang et al., 2024 introduced CORAG, an optimization framework that chooses which and how many chunks to retrieve under a budget constraint (CORAG: A Cost-Constrained Retrieval Optimization System for Retrieval-Augmented Generation) . CORAG uses Monte Carlo Tree Search (MCTS) to evaluate different combinations of candidate passages, considering interactions between chunks (since some pieces of text are redundant when seen together). Importantly, it recognizes that adding more context to the prompt has non-monotonic utility – beyond a point, extra chunks can confuse the model or waste tokens . By treating the number of chunks as part of a constrained optimization (not just maximizing recall), CORAG finds an optimal subset of passages that yields the best answer for the least cost . It improved answer accuracy by up to 30% over baseline RAG methods at given budget levels . This work aligns with a broader theme: smarter selection and ranking of retrieved passages (not just top-k by similarity) can boost performance and avoid overloading the LLM with unnecessary text.
Adaptive Multi-step Retrieval: Complex queries often require iterative retrieval (where the LLM retrieves, reads, then formulates a new query). Running multiple retrieval rounds obviously increases cost. A 2025 study addressed this with a flexible RAG pipeline that can toggle between single-step and multi-step retrieval based on query complexity (Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control). They train a classifier to detect if a question likely needs deeper reasoning, and only then allow iterative retrieval loops; simpler queries are answered with one-pass retrieval or even no retrieval. Furthermore, they expose a user parameter to adjust the accuracy-vs-cost trade-off on the fly . This means an application could choose to favor speed (at slight risk to accuracy) or vice versa, depending on context. Such adaptive systems ensure latency is minimized for easy queries, while harder questions still get the benefit of thorough retrieval and reasoning when needed . It’s an important step toward making RAG more efficient and user-aligned in practical deployments.
Robust RAG with Smaller Models: Another angle of recent research is using retrieval to boost performance of smaller, cheaper LLMs, thereby saving cost. Instead of always using a 175B parameter model, one can use a 7B or 13B model augmented with a strong retriever to achieve comparable results on knowledge-intensive tasks. For example, Zhuang et al., 2024 (Robust RAG via Small LMs) demonstrated that optimizing the retrieval component (through better negative sampling and answer-conditioned retriever training) let a 7B LLM + RAG pipeline match the accuracy of a much larger model on open QA – at a fraction of the inference cost (Robust Retrieval-augmented Generation with Small-scale LLMs via ...) (Note: hypothetical citation, ensure any specific numbers from actual text if available). This line of work highlights that RAG can be a great “force multiplier” for smaller LMs, making them viable in settings where using GPT-4 or other giant models would be too slow or expensive.
In summary, advanced search algorithms for LLMs are rapidly evolving. Vector search innovations are making it faster and cheaper to index and query huge embedding collections. Dense semantic retrieval is being enhanced through LLM guidance and better training regimes. Hybrid retrieval methods are marrying the strengths of neural and lexical approaches to great effect. Across the board, there’s a strong focus on optimization – reducing latency, memory, and cost – without sacrificing accuracy. Better chunking of documents and clever RAG pipeline designs further ensure that LLMs get the right information at the right time, improving reliability and efficiency. The 2024–2025 research landscape shows a clear trend: integrating retrieval more intelligently with large language models is key to unlocking their full potential in real-world applications.
Sources:
Z. Li et al., “Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach,” arXiv preprint 2024 (HERE) .
P. Sun et al., “SOAR: Improved Indexing for Approximate Nearest Neighbor Search,” arXiv preprint 2024 (SOAR: Improved Indexing for Approximate Nearest Neighbor Search).
M. Yang et al., “Fast High-dimensional Approximate Nearest Neighbor Search with Efficient Index Time and Space,” arXiv preprint 2024 ( Fast High-dimensional Approximate Nearest Neighbor Search with Efficient Index Time and Space).
S. Zampetakis et al., “Cracking Vector Search Indexes,” arXiv preprint 2025 (Cracking Vector Search Indexes) .
B. Reichman and L. Heck, “Retrieval-Augmented Generation: Is Dense Passage Retrieval Retrieving?” arXiv preprint 2024 (Retrieval-Augmented Generation: Is Dense Passage Retrieval Retrieving?) .
M. Wu and S. Cao, “LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding,” arXiv preprint 2024 (LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding | OpenReview).
J. Nian et al., “W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain QA,” arXiv preprint 2024 (W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering).
P. Mandikal and R. Mooney, “Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval,” arXiv preprint 2024 ( Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval).
B. Mitra et al., “LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors,” arXiv preprint 2024 (LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors) .
H. Zhang et al., “Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors,” arXiv preprint 2024 (Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search) .
K. Dong et al., “Multi-view Content-aware Indexing for Long Document Retrieval,” arXiv preprint 2024 (HERE) .
Z. Wang et al., “CORAG: A Cost-Constrained Retrieval Optimization System for RAG,” arXiv preprint 2024 (CORAG: A Cost-Constrained Retrieval Optimization System for Retrieval-Augmented Generation) .
J. Su et al., “Fast or Better? Balancing Accuracy and Cost in RAG with Flexible User Control,” arXiv preprint 2025 (Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control) .