Keyword-based Retrieval Method In the context of Advanced Search Algorithms in LLMs

Apr 13, 2025

Browse all previoiusly published AI Tutorials here.

In the context of Advanced Search Algorithms in LLMs Explain the keyword-based retrieval method
Introduction
Keyword-Based Retrieval in LLM Systems
Chunking Strategies for Retrieval
Dense Vector Retrieval
Hybrid Retrieval Models
Performance Benchmarks and Analysis
Conclusion

Introduction

Large Language Models (LLMs) are often enhanced with retrieval components to ground their responses in external documents, a technique known as Retrieval-Augmented Generation (RAG). A critical design choice in RAG systems is how to retrieve relevant information: keyword-based (sparse) retrieval versus dense vector retrieval, or a combination of both (hybrid). Keyword-based methods rely on lexical matching (e.g. overlapping words), while dense retrieval uses neural embeddings to capture semantic similarity. Each approach has distinct advantages, and recent research (2024–2025) has intensively compared their performance and best practices. Another key consideration is how to chunk documents into retrievable units, since LLM contexts are limited. This review surveys state-of-the-art methods for keyword-based retrieval in LLM applications – covering search algorithms and chunking strategies – and contrasts them with dense and hybrid retrieval models. We also discuss implementation details (indexing, ranking, querying) and examine benchmarks and real-world use cases, including document digitization scenarios.

Keyword-Based Retrieval in LLM Systems

Lexical search algorithms: Keyword-based retrieval typically uses inverted indexes and ranking functions like BM25. BM25 remains a widely used sparse retrieval algorithm, integral to search engines like Lucene and Elasticsearch (HERE). Its efficiency and strong generalization on diverse text make it a common baseline even in the LLM era . BM25 scores documents based on term frequency and inverse document frequency (IDF), favoring documents that share many query terms. However, it has well-known limitations: it treats query terms independently and lacks semantic understanding of language . In practice this means BM25 cannot capture synonyms or paraphrases – a query “purchase car” won’t match a document using “buy an automobile” unless those exact words appear (Introducing cascading retrieval: Unifying dense and sparse with reranking | Pinecone). This gap can hurt recall for natural language queries. Recent work has proposed enhancements to classical lexical search. For example, BMX (Li et al., 2024) extends BM25 with an entropy-weighted similarity measure and semantic query expansion; BMX consistently outperformed standard BM25 and even surpassed dense retrievers on certain long-document benchmarks . Such advances bridge classical keyword search with modern semantic techniques, showing that lexical methods remain highly competitive in the LLM era .

Indexing and querying: Implementing a keyword-based retriever involves building an inverted index mapping terms to document IDs (or chunk IDs). Best practices include text preprocessing (tokenization, lowercasing, possibly removing very common stopwords) and storing metadata with each chunk (e.g. document title or section) for context. Queries can be the raw user question or a processed form – for instance, extraneous phrasing or stopwords may be dropped to focus on key terms. Some pipelines use LLMs to aid keyword retrieval by extracting salient keywords from a question or generating synonym expansions. Maintaining an index is efficient: new documents can be indexed incrementally, and lookups are fast on CPUs. Ranking is typically done by BM25 or related scoring; this first-stage ranker can also feed into a second-stage re-ranker (like an LLM or a learned model) for improved accuracy. Recent research demonstrates that even zero-shot, an LLM can act as a ranker when given a query and some candidate passages (PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval), though applying an LLM to every document is usually impractical. In practice, keyword retrieval provides a strong initial candidate set with high precision for exact matches (e.g. technical terms, names).

Chunking Strategies for Retrieval

In LLM applications, documents (which may be long) are split into chunks before indexing. Chunking is essential to ensure that retrieved texts fit within LLM input limits and are topically coherent (Breaking up is hard to do: Chunking in RAG applications - Stack Overflow). The choice of chunk size and boundaries significantly impacts retrieval performance . If chunks are too large, a single embedding or BM25 representation of the chunk may dilute relevant content with unrelated text; if chunks are too small, important context might be lost . Recent studies and industry best practices emphasize finding a balance . Common chunking strategies include:

Fixed-size chunks: Splitting text into uniform blocks of a certain token or word length (e.g. every 200 tokens). This approach is simple and works well for homogeneous text (news articles, etc.) . However, it may cut off semantic units (sentences or paragraphs) arbitrarily.
Overlapping sliding windows: Using fixed-length chunks with overlap between consecutive chunks . Overlap retains context at the boundaries (mitigating the chance that a relevant sentence lies split between chunks), improving recall at the cost of indexing some text redundantly . This increases index size and requires careful handling to avoid returning near-duplicate chunks.
Semantic or context-aware chunks: Splitting on natural boundaries like paragraph breaks, sentence delimiters, or XML/HTML tags . By using punctuation and document structure, this yields chunks that are semantically coherent (each chunk is a self-contained topic or section). For example, a Q&A webpage might be chunked by question, answer, and comments as separate units . This method preserves meaning but requires more preprocessing logic.
Adaptive or ML-guided chunking: Dynamically determining chunk boundaries using machine learning or heuristics that consider the content. For instance, an algorithm might merge or split chunks based on whether a segment is likely to answer a query. Advanced techniques use LLMs or classifiers to decide how to chunk each document optimally . This can produce highly tailored chunks (especially for documents with varying structure), but is compute-intensive.

In practice, a combination of approaches may be used (e.g. hierarchical chunking: index entire documents for coarse retrieval and individual paragraphs for fine retrieval). Many practitioners favor smaller, semantically focused chunks as a default. “Smaller semantically coherent units that correspond to potential user queries” tend to yield more accurate matches (Breaking up is hard to do: Chunking in RAG applications - Stack Overflow). Regardless of method, it’s crucial to store metadata linking each chunk to its source document and position. Such metadata allows reconstructing the context when an LLM forms a final answer, and enables filtering search results by document, section, or other attributes .

To validate chunking strategy, one should empirically test retrieval quality. A recommended practice is to run sample queries against the index, then evaluate retrieved results with human judgment or LLM-based scoring . This helps tune the chunk size/overlap for a given use case. After choosing a strategy, further refinements like dropping very low-similarity results (to reduce noise) can improve the final grounded answers . Recent research also proposes LLM-based filtering at the chunk level: Aggarwal et al. (2024) introduce ChunkRAG, which uses an LLM to evaluate each retrieved chunk’s relevance, discarding irrelevant chunks before generation (ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems) . This chunk-level reranking significantly reduced hallucinations and improved factual accuracy in RAG, by ensuring only the most pertinent text is passed into the LLM . Such techniques highlight the importance of fine-grained control over chunks in LLM retrieval pipelines.

Dense Vector Retrieval

Semantic embedding search: Dense retrieval methods encode documents and queries into high-dimensional vectors such that semantically similar text maps to nearby vectors. This approach, often using bi-encoder transformers (e.g. SBERT, or fine-tuned LLM-based encoders), can match query and document even if they have no words in common, by capturing synonyms and concepts. For instance, a query “buy a car” can retrieve a passage about “purchasing an automobile” via embedding similarity (Introducing cascading retrieval: Unifying dense and sparse with reranking | Pinecone) . Dense retrieval has become a foundation of modern AI search systems for its ability to handle natural language variability . Many 2024 works leverage LLMs for improved dense retrieval. Wang et al. (2024) introduce E5, using LLM-generated synthetic queries to train powerful embedding models. Zhuang et al. (2024) propose PromptReps, prompting GPT-4 to generate hybrid dense and sparse representations for documents, achieving competitive zero-shot retrieval performance without explicit training (PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval). These innovations show dense retrieval’s flexibility – representations can even be produced by LLMs themselves.

Indexing and search: Implementing dense retrieval requires computing vector embeddings for all chunks/documents and building an index to support fast nearest-neighbor search (often approximate). Popular vector indices (FAISS, HNSW, etc.) enable sub-linear search over millions of embeddings. Best practices include normalizing embeddings and choosing an appropriate distance metric (cosine similarity or inner product). Because embeddings are typically ~384–1024 dimensions (floats), the index can be memory-intensive; techniques like product quantization or clustering are used to compress and speed up search. At query time, the user query is encoded by the same model into a vector, and the index returns the top-K closest chunk vectors. These are then re-ranked or directly fed to the LLM. One challenge is that dense models can sometimes retrieve items that are topically related but not specifically answering the query (semantic drift). A common safeguard is to apply a secondary ranker or filter – e.g. ensure that at least one important keyword appears in the retrieved text, or use an LLM to double-check relevance.

Connect with me on X (Twitter)

Performance characteristics: Dense retrieval models often outperform lexical methods on in-domain benchmarks that require understanding context beyond exact wording. A new benchmark for scientific literature search (LitSearch 2024) found that a state-of-the-art dense model (GritLM) achieved 74.8% recall@5, vastly outperforming BM25 by ~24.8 points on challenging research questions (HERE). Moreover, augmenting dense retrieval with an LLM re-ranker (GPT-4) further boosted recall . These results indicate the strength of dense semantic matching in specialized domains. However, dense methods can struggle to generalize to out-of-distribution topics without re-training. Interestingly, very recent work suggests that lexical retrieval can be more robust in zero-shot settings: Zeng et al. (2024) trained large Llama-based retrievers (1B–8B parameters) and observed that learned sparse retrievers consistently outperformed dense retrievers on both in-domain and out-of-domain evaluations (Scaling Sparse and Dense Retrieval in Decoder-Only LLMs). The gap was especially pronounced on the BEIR benchmark (a collection of diverse tasks), where the sparse model had significantly better zero-shot accuracy . They found sparse models generalized better as they grew larger, whereas dense models sometimes overfit to training data . These findings echo the intuition that explicit keyword signals (which sparse methods rely on) remain valuable for capturing exact relevance, factual entities, or rare terms that neural embeddings might miss (Introducing cascading retrieval: Unifying dense and sparse with reranking | Pinecone). In practice, the choice may depend on the use case: if queries require understanding nuanced language or synonyms (e.g. customer support questions), dense retrieval excels; if queries demand pinpoint accuracy on specific jargon, numbers, or names (e.g. legal case lookup), lexical retrieval might be more reliable . This trade-off has led to growing interest in hybrid retrieval systems that leverage both.

Hybrid Retrieval Models

Hybrid retrieval combines sparse and dense approaches to get the “best of both worlds” . In a hybrid setup, a query is processed through both a keyword index (e.g. BM25) and a vector index, and the results are fused. Fusion can be done by score normalization and weighted blending, or by simply taking the union of top results from each and re-ranking them together. Pinecone (2024) introduced a cascading retrieval system that supports such seamless combination: dense embeddings capture broad semantic context, and a complementary sparse index (using BM25 or learned sparse vectors) ensures precise keyword and entity matches . The merged candidates can then be re-ranked by a trained model or an LLM to assign a final relevance order . This cascade approach reported improved search quality on complex queries, compared to using either method alone. Academic work also confirms hybrid benefits. Zhang et al. (2025) developed LevelRAG, a multi-searcher RAG framework that uses a dense retriever, a web search API, and a Lucene-based sparse retriever in concert. It decomposes complex questions into sub-queries and routes them to the appropriate searcher – leveraging lexical retrieval for precise facts (via Lucene keywords) and dense/web search for broader context (LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers) . LevelRAG significantly outperformed prior RAG methods on multi-hop QA tasks, even beating a GPT-4 based closed-book model on accuracy . Similarly, Nguyen et al. (2024) built a hybrid LLM-powered search for digital archives, using a weighted combination of BM25 and vector scores. By tuning the mix, they achieved higher precision than keyword search alone, with an optimal trade-off around an even contribution of dense and sparse (A Proposed Large Language Model-Based Smart Search for Archive System). In their experiments, the hybrid retriever (paired with a Mistral LLM) reached about 80.5% precision, a marked improvement over the purely sparse baseline .

Implementation notes: A hybrid system typically maintains two indexes. At query time, one can retrieve e.g. top-N results from the dense index and top-M from the sparse index, then merge. Because dense and BM25 scores aren’t directly comparable, a common strategy is to rescale them (for instance, some convert BM25 scores to a 0–1 range or use rank-based fusion). An alternative is to train a lightweight model to take features from both (e.g. BM25 score, cosine similarity, etc.) and output a combined relevance score. Simpler yet, some teams use a two-stage cascade: run one retrieval method first to narrow the candidate set, then apply the second method on that subset. For example, one might use an embedding search to fetch 100 candidates with semantic diversity, then use BM25 to re-rank those emphasizing exact term overlap, or vice-versa (Introducing cascading retrieval: Unifying dense and sparse with reranking | Pinecone). In practice, hybrid retrieval often yields the best recall, especially for question-answering tasks: many open-source RAG pipelines default to using BM25 + embeddings together. A final re-ranking step (using a cross-attention model or the LLM itself) can further boost accuracy by resolving any remaining subtle relevance differences. Notably, the LitSearch study found that adding GPT-4 re-ranking on top of a hybrid retrieval improved results, suggesting that even after dense+lexical fusion, an LLM can refine the ranking with deeper understanding (HERE).

Performance Benchmarks and Analysis

Effectiveness: Recent benchmarks underscore that no single method (dense or sparse) dominates universally; performance is data-dependent. Dense retrievers tend to excel on average semantic similarity tasks, while lexical methods shine on robust, zero-shot scenarios or when exact matching is crucial (Scaling Sparse and Dense Retrieval in Decoder-Only LLMs) . The strongest results are often achieved by ensembles or hybrid models. For example, learned sparse retrievers (which use transformer models to generate sparse term-weight vectors) have set state-of-the-art on several benchmarks, essentially combining neural modeling with a keyword index. A 2024 study showed a 8B-parameter learned sparse model outperformed prior dense models like ColBERTv2 on the BEIR benchmark . On the other hand, new dense models like GritLM (2024) outperform BM25 by large margins on specialized tasks , and multi-vector models (e.g. ColBERT) that capture multiple aspects of text can greatly improve over single-vector dense baselines. Overall, hybrid approaches are emerging as the strongest in many contexts, as they can retrieve what dense alone or sparse alone would miss . Importantly, using LLMs in the loop (either to generate better indexes, or to re-rank results, or even to decide between retrieval modes) is a theme in current research. Li et al. (2024) propose a self-reflective routing where an LLM decides per query whether to use RAG or to rely on its own long-context reading, optimizing both accuracy and cost (HERE) . Such meta-reasoning hints at future systems that dynamically choose retrieval strategies. In terms of speed, sparse retrieval with inverted indexes is extremely fast and scales to very large corpora on commodity hardware. Dense retrieval often requires more memory and sometimes GPU acceleration for fast query encoding, but approximate nearest neighbor algorithms keep query latency low (milliseconds for millions of vectors). Many real-world systems now achieve interactive speeds with vectors, especially with hardware-optimized libraries.

Connect with me on X (Twitter)

Real-world applications in document digitization: Organizations digitizing large document collections (scanned PDFs, archives, etc.) are adopting these retrieval techniques to enable LLM-based querying and analysis. A 2025 case study described a “smart search” system for digital archives that integrates LLMs with a hybrid retriever (A Proposed Large Language Model-Based Smart Search for Archive System). In archival collections (which include diverse media and OCR-extracted text), purely keyword search often falls short, missing nuances of user queries . The proposed system uses metadata extraction and hybrid RAG to allow natural language queries over scanned documents, with the LLM providing conversational answers grounded in the retrieved text . The result is a more intuitive search experience in domains like historical archives and enterprise document management, where users can ask questions and get answers with sources from their digitized files. However, digitized documents bring challenges: OCR errors and formatting artifacts can degrade retrieval performance. Zhang et al. (2024) introduced OHRBench, a benchmark assessing the impact of OCR noise on RAG systems (OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation) . They found that current OCR pipelines introduce both formatting noise (broken layout, errors in tables/formulas) and semantic noise (misrecognized words) that significantly hurt retrieval and QA accuracy . No existing OCR was fully adequate for high-quality knowledge bases in their evaluation . This highlights a need for robust retrieval in digitization workflows. Keyword-based search can fail if critical terms are mis-spelled by OCR, and dense embeddings might map noisy text to wrong meanings. To mitigate this, some systems use spell-correction or synonym expansion on OCR text, or employ vision-language models that bypass OCR by directly reading scanned images . Despite these issues, many real-world document processing solutions successfully use chunked OCR text with hybrid retrieval. Industries such as finance and legal are using RAG LLMs on corporate document repositories (contracts, invoices, case files), often starting with a BM25 search over OCR text, then re-ranking results with an LLM to answer user queries. Hybrid search improves recall of relevant info even if wording differs between query and document (common in FAQ or policy documents where a user’s phrasing may differ from the text). In sum, retrieval-augmented LLMs are transforming document digitization: scanned files become queryable knowledge sources. By indexing chunks of text and combining lexical precision with semantic reach, these systems can efficiently unlock information from piles of digitized documents that would otherwise require manual search.

Conclusion

In the 2024–2025 landscape, keyword-based retrieval has proven to be a resilient and even resurgent technology for LLM grounding. Modern lexical algorithms (BM25 and its upgrades) deliver robust performance and remain a staple for indexing and first-stage retrieval (HERE). Dense vector retrieval provides powerful semantic matching that complements lexical search, and the trend is towards hybrid retrieval architectures that capitalize on both exact term overlap and neural similarity (Introducing cascading retrieval: Unifying dense and sparse with reranking | Pinecone). Effective hybrid systems carefully chunk content, index it in multiple forms, and use learned strategies to rank and filter results. Implementing such systems requires attention to indexing structures (inverted files, ANN graphs), proper query handling, and sometimes multi-step reranking. When done well, the reward is significantly improved LLM responses: grounded in relevant documents, less prone to hallucination, and capable of handling both explicit and implicit query semantics. Ongoing research continues to refine these methods – from LLM-driven indexing and chunk filtering (ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems) to adaptive retrieval routing (HERE) – but the core insight is clear: combining keywords and vectors, along with smart chunking, is key to unlocking the full potential of large language models in real-world information retrieval and document understanding tasks.

References: The answer is based on a review of recent literature (2024–2025) on information retrieval for LLMs, including findings from academic papers and industry reports. Key sources include Li et al. (2024) on BM25/BMX, Zeng et al. (2024) (Scaling Sparse and Dense Retrieval in Decoder-Only LLMs) on sparse vs dense scaling, Zhuang et al. (2024) (PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval) on prompt-based representations, Zhang et al. (2024) (OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation) on OCR noise, Zhang et al. (2025) (LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers) on LevelRAG hybrid search, Nguyen et al. (2024) (A Proposed Large Language Model-Based Smart Search for Archive System) on archive search, the StackOverflow/Pinecone discussions on chunking (Breaking up is hard to do: Chunking in RAG applications - Stack Overflow) and hybrid search best practices , and the LitSearch 2024 benchmark results (HERE), among others. These sources are cited in-text to support the analysis.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post

Rohan's Bytes

Keyword-based Retrieval Method In the context of Advanced Search Algorithms in LLMs

Table of Contents

Introduction

Keyword-Based Retrieval in LLM Systems

Chunking Strategies for Retrieval

Dense Vector Retrieval

Hybrid Retrieval Models

Performance Benchmarks and Analysis

Conclusion

Discussion about this post