What are architecture patterns for information retrieval semantic search

Apr 13, 2025

Browse all previoiusly published AI Tutorials here.

What are architecture patterns for information retrieval semantic search
Vector Indexing and Document Chunking
Retrieval-Augmented Generation (RAG)
Hybrid Retrieval Models

A review on document digitization and chunking specifically for feeding data to LLMs, with an emphasis on vector search architectures, retrieval-augmented generation (RAG), and hybrid search models.

Introduction: Large Language Models (LLMs) have limited context windows, making it infeasible to input entire documents directly. Instead, pipelines digitize and chunk documents, then retrieve only the most relevant pieces as additional context. This approach of retrieval-augmented generation (RAG) augments an LLM’s prompts with external knowledge, allowing the model to access information beyond its training data. Grounding responses in retrieved documents improves accuracy and reduces hallucinations, and can enhance the transparency and factuality of outputs. We review recent (2024–2025) advances in document digitization and chunking for LLM pipelines, focusing on vector search architectures, RAG frameworks, and hybrid search models.

Vector Indexing and Document Chunking

LLM pipelines begin by digitizing documents – extracting text from raw sources – and then splitting that text into chunks. PDF files often have complex layouts (multi-column text, headers, footers, tables, images) or exist as scans; these factors complicate text extraction, requiring careful preprocessing to obtain clean text. Once text is extracted, it is segmented into smaller chunks for retrieval. Typically, each chunk is encoded as a vector embedding and indexed in a similarity search database (e.g. FAISS).

Choosing how to chunk the documents is critical. A naive strategy is fixed-length or paragraph chunks, which may ignore the document’s logical structure. Jimeno-Yepes et al. (2024) instead chunk documents by structural elements (sections, tables, lists) identified via layout analysis. They report that structure-based chunking finds effective granularity without manual tuning and improves QA accuracy on financial reports. Beyond static rules, chunking can be adaptive. Zhong et al. (2024) propose Mix-of-Granularity (MoG), which precomputes multiple chunk sizes and trains a router network to pick the best size per query. Fine-grained chunks help when a question asks for specifics, whereas coarser chunks work better for broad queries – MoG learns to balance these automatically. They also extend this with MoG-Graph to retrieve information spread across distant sections by connecting chunks via a graph representation.

Connect with me on X (Twitter)

Retrieval-Augmented Generation (RAG)

At query time, the user’s question is encoded and the top-kk nearest chunk embeddings are retrieved, then prepended to the LLM’s prompt. This RAG paradigm lets LLMs draw on up-to-date evidence without model retraining. However, success depends on retrieving relevant context. If irrelevant text is pulled in, the LLM may still produce incorrect or hallucinated answers.

To improve reliability, researchers have added filtering and control mechanisms. Singh et al. (2024) introduce ChunkRAG, which uses an LLM-based relevance scorer to filter out retrieved chunks that are loosely related to the query. By discarding non-pertinent context before generation, ChunkRAG reduces hallucinations and improves factual accuracy, outperforming baseline RAG models on precision-critical QA tasks. Another line of work compared RAG with using extremely long context windows. Li et al. (2024) found that a sufficiently large-context model (e.g. GPT-4 32K) can surpass RAG in answer quality when given an entire document, but at much higher compute cost. They propose Self-Route, where the system decides per query whether to use a standard RAG pipeline or a long-context LLM, maintaining high accuracy while cutting costs.

Hybrid Retrieval Models

The retrieval component need not rely solely on dense vector search. Hybrid search models combining dense and sparse methods leverage the strengths of each. Dense neural retrievers excel at semantic matching, whereas sparse keyword search (e.g. BM25) ensures exact term recall. Sawarkar et al. (2024) present “Blended RAG”, which fuses dense and sparse results, achieving higher accuracy on QA benchmarks than either alone. Similarly, in enterprise QA, adding BM25 on top of a dense retriever yielded more accurate, grounded answers than a purely dense approach.

Efficiency is a challenge for hybrid retrieval, since running two search engines in parallel is costly. Zhang et al. (2024) address this with a unified graph-based ANN index for dense + sparse vectors. By aligning dense and sparse vector distributions and using a two-stage retrieval (dense-first, then hybrid), their approach achieves an order-of-magnitude speedup over naive hybrid search at the same accuracy. Beyond text, hybrid RAG systems are starting to incorporate structured knowledge. For example, GraphRAG integrates knowledge graphs into retrieval. Graph-based data encodes relational facts that complement text, but also requires specialized retrievers and fusion logic. Early results show that linking LLMs with graph databases can improve multi-hop reasoning on complex queries.

Overall, the latest research underscores that effective document chunking and advanced retrieval are key to feeding LLMs with external data. By digitizing documents carefully, chunking them intelligently, and using powerful vector or hybrid search, RAG pipelines can provide LLMs with relevant context at scale. These developments – from adaptive chunking to dense-sparse retrieval techniques – are making LLM-based systems more accurate, efficient, and trustworthy for knowledge-intensive applications.