Browse all previously published AI Tutorials here.
Table of Contents
Similarity Metrics for Document Chunking in RAG Systems
Semantic vs. Lexical Similarity
Efficiency and Scalability
Robustness to Noise and OCR Errors
Chunking Strategies and Similarity
Choosing an Optimal Similarity Metric: Key Considerations
Sources
Similarity Metrics for Document Chunking in RAG Systems
Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant document chunks to ground the outputs of large language models (LLMs). A critical design choice is the similarity metric used to match queries with document text. Recent literature (2024–2025) examines both lexical and semantic similarity approaches, comparing their efficiency, scalability, and robustness in the context of document digitization (OCR) and chunking strategies. Below, we review key findings and practical considerations from the latest research.
Semantic vs. Lexical Similarity
Lexical similarity metrics (e.g. TF-IDF or BM25) represent documents as sparse term vectors and score overlap of query and document terms (The Power of Noise: Redefining Retrieval for RAG Systems) . This approach excels at exact keyword matching but struggles with semantic paraphrases. Semantic similarity uses dense vector embeddings (typically neural encoders) and measures distances (e.g. cosine similarity) in embedding space . Dense embeddings capture conceptual relationships beyond exact wording, addressing lexical gap issues . A 2024 RAG survey notes that pure vector-based semantic search may “miss lexically important matches,” while pure keyword search “could overlook semantic relationships.” Balancing the two is a known challenge (RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems). In practice, semantic retrieval often uses cosine similarity or dot product between query and chunk embeddings (HERE), whereas lexical methods use BM25 or related scoring for term overlap.
Hybrid retrieval combines both types: e.g. performing parallel dense and sparse searches and merging results. This can yield more robust retrieval, as dense methods retrieve conceptually relevant text while lexical matching ensures important keywords aren’t missed . Indeed, multiple studies in 2024 advocate hybrid strategies as best-of-both-worlds solutions (Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?) . Empirical results show that hybrid search can significantly improve RAG performance compared to using only one method . Recent work also explores reranking top results with cross-encoder models for finer semantic matching , though this is computationally expensive for large candidate sets.
Efficiency and Scalability
A key consideration in choosing a similarity metric is computational efficiency – both at query time and during indexing – and scalability to large corpora. Classic lexical indices (inverted indexes for BM25) are highly optimized and can retrieve results in milliseconds even from millions of documents. Neural semantic search requires computing embeddings and performing nearest-neighbor search in a high-dimensional space, which is more compute- and memory-intensive. Recent empirical evaluations provide insight into these trade-offs:
Throughput (QPS): Lin (2024) compared a dense bi-encoder model (BGE) vs. a sparse learned model (SPLADE) and BM25 on BEIR benchmarks (Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?). They found no overwhelming winner in retrieval quality alone – learned sparse and dense models had similar effectiveness – but BM25 was much faster, especially on large corpora . In the largest corpus tested (~1M+ documents), BM25 achieved an order-of-magnitude higher queries per second than the neural models . This highlights that in high-throughput or real-time applications, lexical methods still offer a performance advantage.
Indexing and Memory: Dense vector search typically uses approximate nearest neighbor structures (like HNSW graphs) to scale. Building these indexes can be time-consuming for millions of embeddings, though they enable fast approximate search. Lin’s study advises that for corpora under ~1M documents, a brute-force (flat) index or even exhaustive search may be sufficient, as HNSW adds little benefit . For larger corpora, HNSW indexes drastically improve query latency at the cost of longer indexing time and slight accuracy loss . Notably, approximate indexes and quantization introduce minor degradation in retrieval effectiveness (e.g. small drops in nDCG), an important practical detail often overlooked in research . In contrast, inverted indexes for lexical search are relatively lightweight to build and update incrementally, making them scalable for dynamic knowledge bases.
Embedding Computation: Semantic similarity requires encoding each query (and document) with a neural model. This adds latency per query and scales with model size. However, advances in embedding model efficiency (smaller models, knowledge distillation) and hardware acceleration have made it feasible for many applications. Practitioners often cache document embeddings offline, so the main cost is query encoding at runtime. Still, if extremely low latency is needed, lexical retrieval (which only requires simple text processing on queries) has an edge.
In summary, lexical similarity (BM25) offers speed and scalability, while dense semantic similarity offers richer matching at higher computational cost. Depending on system constraints, a hybrid setup or cascaded approach (fast lexical retrieval to narrow candidates, followed by semantic rerank) may be optimal.
Robustness to Noise and OCR Errors
Document digitization via OCR introduces noise – misrecognized characters, words, and formatting – which can disrupt both lexical and semantic retrieval. Recent studies have specifically evaluated how different retrievers handle noisy text:
OCR Impact on Retrieval: Zhang et al. (2024) introduced OHRBench, a benchmark to assess OCR noise in RAG pipelines (OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation) . They evaluated a sparse BM25 retriever versus a dense embedding model (BGE) under increasing noise. On clean text, BM25 slightly outperformed the dense model, but as noise increased, BM25’s performance dropped sharply, eventually falling below the dense retriever . This indicates lexical similarity is highly sensitive to spelling and formatting errors – if a query term is garbled in OCR, BM25 fails to match it. Dense embeddings showed more robustness to semantic noise (e.g. character swaps or minor errors) , likely because the encoder can still capture contextual meaning to some extent. However, dense methods are not immune to noise either; very severe OCR errors degrade any model’s understanding.
Multilingual/OCR QA: A 2025 multilingual QA study found that QA systems “are highly prone to OCR-induced errors” and suffer notable performance degradation on noisy text (MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts). This underscores the importance of robust retrieval when working with digitized documents. Techniques like query expansion or fuzzy matching can help lexical methods handle typos, whereas for semantic retrieval, finetuning embeddings on noisy text or using character-aware models can improve resilience.
Structured Data and Format: Noise isn’t only character errors – formatting differences (tables, formulas, special symbols) also pose challenges. OHRBench identifies formatting noise (like LaTeX artifacts in extracted text) which can confuse retrievers (OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation) . The study showed that certain advanced LLM-based retrievers were relatively robust to formatting clutter, but overall retrieval performance dipped when extraneous tokens were present . For instance, table-heavy queries saw up to ~10% retrieval performance drop for BM25 under noisy formatting . This suggests that cleaning OCR output (removing artifacts) or using models trained to ignore formatting tokens is important for robust similarity matching.
Practical takeaway: In scenarios with noise (e.g. scanned documents, user-generated text with typos), semantic similarity metrics tend to be more forgiving to imperfect text than strict lexical matching . A hybrid approach can also help: BM25 can retrieve exact matches for correctly recognized parts, while an embedding-based search can catch semantically relevant text that lexical search misses due to OCR errors. Additionally, pre-processing steps (spell correction, OCR post-processing) improve lexical retrieval robustness.
Chunking Strategies and Similarity
Large documents must be split into chunks for retrieval, but how to chunk can influence retrieval success. Fixed-size chunking (splitting text into equal-length segments) is simple and efficient, whereas semantic chunking aims to break documents at semantically coherent boundaries (e.g. topic shifts) by using similarity metrics. This is directly related to similarity measures: semantic chunking algorithms often use an embedding model to decide chunk boundaries (for example, splitting where adjacent sentences have low cosine similarity) (HERE) .
A comprehensive study by Qu et al. (2024) questioned the value of semantic chunking. They evaluated retrieval and QA performance using semantic-based chunks vs. fixed-size chunks across tasks . The surprising result: the benefits of semantic chunking were inconsistent and often not enough to justify its higher computational cost . In some cases semantic chunks improved retrieval of relevant passages, but many times a simple fixed window (with possibly slight overlap) worked as well or better . The advantages of semantic segmentation were “highly task-dependent and often insufficient to justify the added computational costs” . In other words, using embedding-based similarity to create chunks (which requires encoding and clustering sentences) didn’t consistently boost downstream RAG performance.
On the other hand, other researchers still see promise in smarter chunking for complex queries. A technique called ChunkRAG (2024) proposed forming “semantically coherent and non-overlapping chunks” to better align with information needs (HERE) . This method groups consecutive sentences until a drop in cosine similarity (below a threshold) triggers a new chunk, ensuring each chunk is topically unified . The ChunkRAG pipeline then applied hybrid retrieval (BM25 + embedding ensemble) on these chunks, and additional filtering to remove redundancy (by eliminating chunks with very high mutual similarity) . Such a pipeline showed reduced irrelevance and redundancy in retrieved context, which can help mitigate LLM hallucinations. The mixed findings suggest that while naive semantic chunking alone may not always pay off (HERE), domain-specific chunking combined with robust retrieval/filtering can still improve RAG results in certain settings .
Chunk size also affects similarity retrieval: smaller chunks (fine-grained) increase the chances that a relevant piece is retrieved but also risk losing context. Larger chunks carry more context but may dilute relevance scoring if they contain mixed content. The optimal balance can depend on the retrieval metric – lexical BM25 might favor smaller chunks (so query terms aren’t diluted by unrelated text), whereas embeddings can handle larger chunks since they encode broader context. Researchers often use overlap between fixed chunks to maintain context continuity . In practice, starting with a moderate fixed length (e.g. 200-300 tokens) and using overlap has been a robust baseline, with semantic-based chunking considered if a particular task shows benefit.
Choosing an Optimal Similarity Metric: Key Considerations
Recent studies converge on a few practical guidelines for selecting similarity metrics in RAG and search systems:
Task and Content Characteristics: If exact terminology or precision is crucial (e.g. legal or technical documents, structured fields), lexical similarity may be necessary to hit exact matches. If queries are more conceptual or the corpus uses varied language (synonyms, paraphrases), semantic embeddings will dramatically improve recall of relevant information (The Power of Noise: Redefining Retrieval for RAG Systems). For heterogeneous information needs, a hybrid approach is safest (Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?).
Scale and Latency Requirements: For large-scale search with millions of documents or strict latency constraints, efficient sparse methods (BM25 or learned sparse models) are attractive due to their speed . Dense retrieval can be scaled with ANN indexes and hardware, but requires more resources and careful tuning . If using dense retrieval at scale, investing in index optimization (HNSW, quantization) is important, and one should account for a small loss in retrieval accuracy from approximate search . Smaller deployments (e.g. enterprise documents up to a few hundred thousand) can comfortably use dense embeddings with flat indexes or hybrid search for better accuracy.
Robustness Needs: In settings with noisy data (OCR-digitized archives, user text with typos, multilingual mixtures), embedding-based similarity is generally more robust to imperfect text (OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation). Lexical metrics can be augmented with pre-processing (spell correction, synonym expansion) to partially mitigate this. If the knowledge base text is generated via OCR, consider using an OCR-specific benchmark or testing retrieval efficacy under various error rates (MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts). For highly structured text (tables, code, forms), no single similarity metric may suffice – specialized parsing or treating structure separately might be needed, as both BM25 and vanilla embeddings struggle with non-linear text layouts.
Resource Constraints: Computing embeddings for every document and query introduces overhead. If computational budget is limited, one might use lexical search as a first-stage filter (cheaply narrowing down candidates) then apply a semantic re-rank on the top results. This two-stage setup often yields a good balance: BM25 ensures relevant keyword matches are not missed, and the reranker (using a more powerful semantic metric or cross-attention model) ensures the final ranking prioritizes truly relevant, on-topic chunks.
Hybrid and Ensemble Methods: The consensus in late-2024 literature is that hybrid retrieval is a strong default (Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?). By combining cosine similarity of embeddings with lexical scoring (sometimes via a weighted sum or by simply merging result lists), systems can cover each method’s blind spots. For example, one can retrieve top-$k by BM25 and top-$k by a dense model, then union these sets and re-rank them (possibly by an LLM or a learned ranker). This approach was shown to improve answer recall and downstream QA accuracy in several studies . The only downside is added complexity and the need to maintain two index types, but frameworks are emerging to support this seamlessly.
In conclusion, semantic similarity metrics (e.g. embedding cosine) and lexical metrics (e.g. BM25) each have distinct strengths. Lexical methods offer speed, interpretability, and exact matching – valuable for large-scale and precision-critical search. Semantic methods offer superior recall and understanding, crucial for open-ended queries and overcoming vocabulary mismatch. The most robust RAG systems in 2024–2025 tend to use a combination: intelligent chunking to optimize the units of retrieval, hybrid similarity search to retrieve diversely relevant context, and multi-step filtering to ensure the retrieved chunks are relevant and not redundant (HERE) . As research suggests, one should choose the similarity metric (or mix of metrics) by weighing the domain requirements (speed vs. accuracy vs. noise tolerance) and even consider adaptive strategies that can switch or ensemble methods as needed. This balanced approach is key to building scalable, efficient, and reliable RAG pipelines grounded in the latest findings from literature.
Sources:
Renyi Qu et al. (2024). “Is Semantic Chunking Worth the Computational Cost?” – Evaluation of semantic vs. fixed chunking (HERE) .
Jimmy Lin (2024). “Dense vs. Sparse Retrieval: Operational Trade-offs.” – Efficiency and effectiveness comparison of BM25, SPLADE, and embeddings (Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?) .
Junyuan Zhang et al. (2024). “OCR Hinders RAG” – Impact of OCR noise on lexical (BM25) vs. dense retrieval performance (OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation).
RAG Playground (2024). “Framework for Evaluating Retrieval Strategies.” – Noted challenge of semantic vs lexical matching and benefits of hybrid search (RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems) .
ChunkRAG (2024). “Mitigating Irrelevance and Hallucinations in RAG.” – Uses semantic chunking + hybrid retrieval; demonstrates redundancy filtering with cosine similarity (HERE) .
MultiOCR-QA (2025). “Robustness of QA on Noisy OCR Text.” – OCR errors significantly degrade QA performance, highlighting need for robust retrieval (MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts).