Benchmarking Embedding Models for Document Digitization and Chunking

Jun 15, 2025

Browse all previously published AI Tutorials here.

Benchmarking Embedding Models for Document Digitization and Chunking
Introduction
Embedding Model Landscape (2024–2025)
Real-World Corpora and Benchmarks
Retrieval Accuracy and Robustness to Noise
Document Chunking Strategies
Efficiency and Scalability Considerations
Use Cases and Task Performance
Conclusion

Introduction

Document digitization pipelines rely heavily on embedding models to convert text (and sometimes images) into vector representations for tasks like search, classification, and summarization. Recent ArXiv works (2024–2025) emphasize transformer-based bi-encoders that produce semantic text embeddings (ColPali: Efficient Document Retrieval with Vision Language Models). These dense vector models have achieved state-of-the-art retrieval accuracy on standard benchmarks , often outperforming traditional lexical methods on semantic queries. However, challenges remain in handling structured documents (PDFs, scanned forms) and unstructured text (contracts, web pages) at scale. Key issues include maintaining high retrieval accuracy under noise (e.g. OCR errors), ensuring computational efficiency and scalability for large corpora, and effective chunking of long documents to fit model context windows.

Embedding Model Landscape (2024–2025)

Transformer-based embedding models from both proprietary (OpenAI, Cohere) and open-source groups (e.g. BAAI’s BGE, E5, MTEB leaders) have been extensively compared. OpenAI’s 2024 text embedding model (e.g. text-embedding-ada-002 or newer) achieved strong benchmark scores (e.g. ~64.6 on MTEB) (HERE) , but open models quickly surpassed this. For example, E5-Mistral (7B) fine-tuned with contrastive instructions reached an MTEB score of 66.6, outperforming prior bi-directional models . Similarly, BGE (BAAI’s “M3-Embedding”) introduced in 2024 is a multilingual model supporting 100+ languages and long inputs (up to 8192 tokens), setting new state-of-the-art on multilingual and cross-lingual retrieval tasks (BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation). In fact, an enhanced BGE variant with in-context learning (bge-en-icl) achieved an MTEB score above 71 , topping the public leaderboard. Cohere’s transformer embedding models (e.g. multilingual Embed v3) are also noted for strong performance, often on par with these open models (exact figures vary by task). Overall, transformer encoders from OpenAI, Cohere, and open-source (BGE, E5, etc.) all demonstrate high retrieval quality, with open models now matching or exceeding proprietary ones on many benchmarks . The trade-offs often come down to model size and efficiency: larger models (billions of parameters) yield higher accuracy but incur more compute cost .

Real-World Corpora and Benchmarks

To evaluate embedding models for document tasks, researchers use diverse real-world corpora. Wikipedia-based open-domain QA sets like Natural Questions (NQ) and HotpotQA are common for unstructured web text, while scientific literature benchmarks like SciDocs and SciFact represent research papers. These were included in BEIR and related evaluation suites (HERE). However, many such datasets consist of short passages (~1–2 paragraphs each), insufficient to test long-document chunking. Gao et al. (2024) address this by stitching documents to create ~100-sentence long texts, better simulating real articles . For structured documents, new benchmarks have emerged. Faysse et al. (2024) introduced ViDoRe, a visual document retrieval benchmark with pages from invoices, forms, and scholarly PDFs (ColPali: Efficient Document Retrieval with Vision Language Models). It assesses retrieval when information is split between text and layout elements (figures, tables, etc.). Traditional text-only pipelines struggle on such visually rich pages . In ViDoRe, methods that directly embed the page images (using vision-language models) showed superior accuracy. For example, the ColPali model (a multi-vector Vision-Language embedding approach) outperformed text-extraction pipelines on page-level retrieval across diverse document types, while also meeting practical latency constraints . These benchmarks highlight how Wikipedia and scientific text remain staples for evaluating unstructured document retrieval, whereas new datasets for forms, invoices, and scans test embedding models’ ability to handle structure and noise.

Retrieval Accuracy and Robustness to Noise

Modern embedding models deliver strong retrieval accuracy in semantic search and question-answering, but their robustness varies with noise and document quality. Dense retrievers excel at finding semantically relevant content, though purely keyword queries can favor sparse methods (BM25), which some studies note can degrade dense retrieval performance for certain query types . A critical concern for document digitization is OCR noise. Philippy et al. (2025) showed that multilingual embeddings struggle on imperfectly digitized historical text, where OCR errors and archaic spelling significantly hurt cross-lingual semantic search (HERE). They adapted models with in-domain data to improve accuracy on 19th-century scanned newspapers, reaching 98% retrieval accuracy after fine-tuning . Generally, OCR-induced corruptions cause larger drops in performance than other noise types. One analysis found that LLM-based QA accuracy declines more steeply with OCR errors than with comparable amounts of ASR (speech) errors (HERE) . This suggests current embeddings are not inherently robust to the spelling and tokenization distortions from OCR. To combat this, researchers have begun integrating OCR confidence scores into embeddings (e.g. in a BERT-based model ConfBERT ([PDF] arXiv:2409.04117v1 [cs.CV] 6 Sep 2024](https://www.arxiv.org/pdf/2409.04117#:~:text=,training phase)) and using spelling correction or augmentation to make models resilient to noise. Additionally, irrelevant text (distractor noise) can confuse retrieval systems. Cuconasu et al. (2024) injected random “noise” documents in Retrieval-Augmented Generation (RAG) pipelines and observed that some LLMs’ QA accuracy drops significantly when irrelevant chunks are retrieved alongside relevant ones (The Power of Noise: Redefining Retrieval for RAG Systems) . Interestingly, the performance degradation was worst when noisy documents appeared closer to the query in the LLM’s context . These findings underscore the need for both robust embeddings (to handle OCR or typos) and better filtering of retrieved results to avoid misleading the end task.

Document Chunking Strategies

Chunking long documents into manageable pieces is a crucial preprocessing step for embedding-based systems. Without chunking, important details might be lost due to context window limits or “over-compressed” embeddings ( Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models). The default in many pipelines is to split text into fixed-size segments (e.g. 512 tokens), encode each independently, and store these vectors for retrieval . While effective, this can break semantic coherence – e.g., coreferences like “its capital” may no longer link to “Berlin” if split into different chunks . Recent research has proposed smarter chunking methods: Late Chunking (Günther et al., 2024) delays the split until after the transformer encoding. The model encodes the entire document (leveraging long-context models up to 8K tokens), then mean-pools embeddings in sliding windows to generate chunk vectors that each reflect the full-document context . This approach yielded consistently higher retrieval performance on benchmarks – about 1.5–1.9% absolute nDCG@10 improvement on average across datasets like FiQA, SciFact, and NFCorpus . Another approach, targeted at structured PDFs, is element-wise chunking. Jimeno-Yepes et al. (2024) showed that splitting financial reports by their logical elements (headings, tables, itemized lists) rather than uniform blocks improves downstream QA. On a finance QA benchmark, element-type chunking boosted accuracy from 50% to ~53.2% – a notable gain over standard 512-token chunks (HERE). The structured chunks preserved context (e.g. a table and its caption together), leading to more relevant retrieval and answers. Beyond how to chunk, which chunks to use is also being optimized. Singh et al. (2024) introduced a chunk filtering technique (“ChunkRAG”) that uses an LLM to score each retrieved chunk’s relevance to the query (HERE). By discarding low-relevance chunks before generation, they reduced hallucinations and improved factual accuracy in RAG responses . This highlights that both intelligent chunk formation and chunk selection are key to robust performance.

Efficiency and Scalability Considerations

Computational efficiency is paramount when embedding large document collections. Industry-scale retrieval systems impose latency limits for both indexing and query phases (ColPali: Efficient Document Retrieval with Vision Language Models) . Many 2024 papers therefore evaluate not just accuracy (R1), but also indexing throughput (documents encoded per second, R3) and query latency (R2) . A common finding is that document parsing and chunking can dominate the pipeline time: for example, Faysse et al. note that full PDF OCR+parsing is often slower than embedding the text with a model . Embedding models themselves vary in speed – a lightweight model like BM25 can score queries in ~1–3ms, whereas a transformer embedding (e.g. BGE-M3) might take ~8–9ms per query on similar hardware . To scale to millions of documents, vector index size and search speed become concerns. Multi-vector approaches (splitting a page into multiple embeddings as ColPali does) improve recall but increase index size. Researchers mitigate this via clustering or compression: grouping similar chunk vectors and storing centroids, or quantizing vectors to fewer bits, can cut storage by two orders of magnitude with minimal accuracy loss . Another angle is designing models that handle longer input in one pass to reduce the number of chunks. The Mamba retriever (Cao et al., 2024) is a 130M-param model that can encode entire 100+ page documents in linear time, retrieving answers from full-text instead of many separate chunks (Efficient Full-Context Retrieval for Long Documents | OpenReview) . This model matched or exceeded the accuracy of much larger embedding models on 41 long-document QA tasks, while being faster and memory-efficient . In general, techniques like synthetic data training (to teach models long-range dependencies) , batch optimization for multi-task training (HERE), and pruning/quantization for smaller model sizes are all being applied to push the frontier of scalable embedding-based retrieval. The result is a new generation of retrievers that approach the quality of LLM reasoning on long texts (GPT-4-level performance on 256k-token inputs, in Mamba’s case) but at a fraction of the computational cost .

Use Cases and Task Performance

Embedding models are benchmarked across a spectrum of use cases:

Semantic Search & Retrieval: Dense embeddings are particularly effective for search. They capture semantic similarity, enabling retrieval of relevant passages even when exact keywords differ ( Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models). On standard retrieval tasks (FAQ matching, web search, QA), transformer embeddings significantly improve recall of relevant documents compared to lexical baselines. For instance, late chunking improved nDCG on TREC-COVID and SciFact searches by pooling context . Multi-hop QA tasks also benefit: the RAPTOR system (2024) built a tree of chunk embeddings and their abstractive summaries, enabling retrieval of broader context. Coupling this with GPT-4, it achieved a 20% absolute accuracy gain on the QuALITY long-document QA benchmark over previous bests ( RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval). This demonstrates that enhanced retrieval (using hierarchical embeddings + summaries) can dramatically improve complex question answering.
Document Classification: Rather than using task-specific classifiers, one can classify documents via embeddings by comparing them to category prototypes or training a simple classifier on top of embeddings. Some recent embedding models are explicitly trained to perform well on classification tasks alongside retrieval. For example, SFR-Embedding and NV-Embed fine-tune a single model on a blend of retrieval data and labeled topic/intent data, achieving strong accuracy on clustering and classification evaluations (HERE) . NV-Embed (2024) reported state-of-the-art results on 12 classification tasks in MTEB while still excelling at search . The key is modifying the training loss (e.g. handling in-batch negatives differently) to not degrade non-retrieval task performance . In practice, these embeddings enable fast semantic classification – for example, tagging legal contracts by type or routing scanned invoices by content – with just a nearest-neighbor or linear layer, avoiding expensive full-text analysis.
Summarization and Content Chunking: While summarization is typically a generative task, embeddings assist it in several ways. First, to summarize very long documents, systems retrieve the most salient or relevant chunks as determined by embedding similarity. This is akin to a semantic search for key points before generating a summary. Chunk-based retrieval has been used to feed LLMs only the most important sections, thereby improving summary focus and factuality (HERE). Second, hierarchical methods like RAPTOR use iterative chunk clustering and summarization, guided by embeddings, to condense a document into a tree of summaries ( RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval). This allows an LLM to generate a final summary with awareness of the whole document’s structure. The improved performance on multi-step reasoning tasks suggests summarization of sub-parts (using embeddings to identify related content) yields more comprehensive and accurate summaries . Additionally, embeddings can help evaluate summaries: e.g. using embedding-based similarity (BERTScore or cosine similarity) to ensure a summary covers the source content. Overall, while embeddings don’t produce summaries alone, they play a supporting role in retrieving, organizing, and verifying content for summarization.

Conclusion

In the past two years, benchmarking studies on ArXiv have significantly advanced our understanding of embedding model performance in document digitization workflows. Transformer embeddings (OpenAI, Cohere, BGE, etc.) are the backbone of modern semantic retrieval, with accuracy gains demonstrated on Wikipedia QA, scientific literature search, and beyond (HERE). At the same time, research has highlighted the importance of efficient chunking and retrieval pipelines: from late chunking that preserves context ( Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models), to structure-aware chunking for complex documents (HERE), to filtering and multi-scale retrieval to reduce errors . These techniques improve not only retrieval accuracy but also the end-task performance in QA and summarization. Moreover, the field is moving towards solutions that are both scalable and robust – capable of indexing millions of diverse documents, handling noisy OCR text, and still returning relevant information quickly. As of 2025, open-source embedding models have reached parity with proprietary ones in many areas, enabling wide access to high-quality embeddings. Ongoing benchmarks using real-world corpora (from legal contracts to scanned invoices) continue to drive innovation in model design and evaluation. The surveyed literature paints an optimistic picture: through careful benchmarking and novel chunking strategies, we are overcoming the challenges of structured vs. unstructured data and inching closer to retrieval systems that are accurate, efficient, and reliable even in noisy, complex document collections.

Sources:

Chen et al., 2024 – BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings (BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation)
Jimeno-Yepes et al., 2024 – Financial Report Chunking for Effective RAG (HERE)
Günther et al., 2024 – Late Chunking: Contextual Chunk Embeddings Using Long-Context Models ( Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models)
Singh et al., 2024 – ChunkRAG: LLM-Driven Chunk Filtering for RAG (HERE)
Faysse et al., 2024 – ColPali: Efficient Document Retrieval with VLMs (ViDoRe Benchmark) (ColPali: Efficient Document Retrieval with Vision Language Models)
Philippy et al., 2025 – Adapting Multilingual Embeddings for Noisy Historical Text (HERE)
Cao et al., 2024 – Mamba: Full-Context Retrieval for Long Documents (Efficient Full-Context Retrieval for Long Documents | OpenReview)
Sarthi et al., 2024 – RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval ( RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval)
NV-Embed (NVIDIA), 2024 – Improved Training for LLM Embeddings (ICLR 2025) (HERE)
BEIR Benchmark (Thakur et al., 2021) analysis in arXiv 2024 long-document retrieval studies

Rohan's Bytes

Discussion about this post